Cassandra Architecture and Operation

Architecture

Cassandra can be configure to be rack and data center aware.
Cassandra usages Gossip protocol to failure detection. It uses an algorithm for distributed computing called Phi Accrual Failure Detection. The white paper can be found here.
Cassandra uses snitch to determine relative host proximity for each node in a cluster and to gather information about the network topology. Cassandra implements different types of snitch that can be configure. It is possible to build your own snitch as well.
Partitions are distributed by Cassandra assigning each nodes a token. Tokens are reassign when a node is taken out of the cluster by using vnode.
Cassandra keyspace can be replicated by setting the replication factor. Cassandra can replicate data depending on the network topology. Replicas can be set across data centers and racks.
Consistency levels can be control by the client:
- One -> One replica node must response to WRITE or READ request.
- Two -> Two replica nodes must response to WRITE or READ request.
- Three -> Three replica nodes must response to WRITE or READ request.
- Quorum -> (replication factor / 2 + 1) to WRITE or READ request.
- All -> All replica nodes must response to WRITE or READ request.
- Queries can be run in any of the Cassandra nodes in a cluster. This node will become the coordinator node and coordinator node will identify which node the data has to be written or read from.
Cassandra keeps internal data in 3 places Memtables, SSTables and commit logs.
- Commit logs are immediately written when a WRITE operation happen. This logs are use to recover data when Cassandra is restarted and the data has not been saved to disk.
- Memtables is Cassandra's in-memory storage. This data is not kept in JVM but in a native memory implementation. Each commit log has a one bit flag to determine if data need to be flush to disk.
- SSTables are use to keep Cassandra data after it has been flush from memory. SStables are immutable. SSTables are only removed after a mergesort and the old files are remove on success.
- Memtables and SSTables are read to find data values for a READ query.
- Cassandra Caching layers
- Key Cache stores a map of partition key to row index entries making it easy to read access into SSTables. This cache is kept in the JVM heap.
- Row Cache stores a full row to improve reads for frequently access rows. This cache is kept in an off-heap memory. This is off by default.
- Counter Cache is use to reduce lock contention for the most frequently access counters. This is store in the JVM heap.
If a node is offline where a write needs to happen the coordinator node will hold the write as part of a hinted handoff. This write will later be send to the offline node after it gets back online. Only applies to ANY consistency level.
Cassandra support lightweight transaction by using IF NOT EXISTS
Cassandra doesn't delete rows but marks it for deleting and this tombstones are GC after 10 days. 10 days is the default setting. The GC is delayed to give offline nodes enough time to come back online. If the node down then the node is treated as failed and replaced.
Cassandra Compacts larger files into one and create new index of the new sorted data. A new SSTables is created. Compaction increase performance as well by reducing the number of required seeks. During compacting disk IO spikes.
Cassandra reads data from multiple replicas in order to achieve the requested consistency level, and detects if any replicas have out of date values. If the number of nodes have a out of date values, a read repair is immediately done. Otherwise, the repair happens in the background.
Anti-entropy repair is done manually after a node has been offline for a while. This is done using the nodetool repair.

Performance

Create a performance goal and use cassandra-stress tool to generate test load. It is better to do this in the production hardware.
It is recommend to keep commit logs and SSTables in separate hard drives to improve performance.
Use tracing to see how a query is behaving.
Make sure to use caching when possible. Caching can be configure per table
Caches can be save to disk and be reuse after a node is restarted.
Cassandra will use JVM heap memory and off-heap memory of Memtables. You can influence cassandra to use different type of memory with memtable_allocation_type. off-heap memory is affected less by the JVM GC.
It is better to use multiple processing cores over one or two very fast ones.
Cassanra has write_survey mode. This mode allows the live traffic to hit this node but won't affect the cluster. The node never joins and will never handle real traffic.

Maintenance

Flush Memtables to disk with nodetool flush on the node. You can also flush specific keyspace with nodetool flush keyspace_name.
nodetool cleanup scans all nodes and discards any data that is no owned by the node.
If a node gets out of sync and the cluster has been gc, you will need to repair the cluster with nodetool repair. The repair command can be limited to a data center with -in-local-dc or -dc . After repairing a table you should rebuild the index with nodetool rebuild_index. Both repair and rebuild_index are CPU and I/O intensive.
You can decommission a node by running nodetool decommission on the node. You can see the progess of the decommission by using nodetool netstats.
To remove a down node you can use nodetool removenode.
When upgrading a cluster make sure to read NEWS.txt and follow the upgrading instructions there.
In most cases Cassandra can be upgraded by a rolling upgrade.
- Run nodetool drain to flush writes to disk and stop the node processing writes. Another node will take over the writes.
- Stop the node.
- Make sure to backup cassandra.yml and cassandra-env.sh.
- Install new version.
- update new cassandra.yml and cassandra-env.sh to match your backup configuration.
Backing up Cassandra
- Full backups
- Incremental backups
You can configure Cassandra to take auto snapshots when a DROP keyspace, DROP table or TRUNCATE.

Monitoring

Cassandra exposes JMX with all its metrics points
Heap Usage
CPU Usage
nodetool can also be helpful to check the status of a cluster.
nodetool tpstats
nodetool proxyhistogram show how the whole cluster is performing.
nodetool tablehistogram show how the table is performing.
Metrics are reset after a node is restarted.

Resources
https://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412
https://academy.datastax.com

Architecture

Performance

Maintenance

Monitoring

Resources