For my master's thesis I'm going to be working with Cassandra, an open source distributed database management system, and therefore I'll probably write a lot about it throughout the next year. To get started let's take a look at one of the biggest differences from this kind of DBMS to the classical relational systems, the data model.
Cassandra that was created on Facebook, first started as an incubation project at Apache in January of 2009 and is based on Dynamo and BigTable. This system can be defined as an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database.
Cassandra is distributed, which means that it is capable of running on multiple machines while the users see it as if it was running in only one. More than that, Cassandra is built and optimized to run in more than one machine. So much that you cannot take full advantage of all of its features without doing so. In Cassandra, all of the nodes are identical, there is no such thing as a node that is responsible for certain organizing operations, as in BigTable or HBase. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.
Being decentralized means that there is no single point of failure, because all the servers are symmetrical. The main advantages of decentralization are that it is easier to use than master/slave and it helps to avoid suspension in service, thus supporting high availability.
Scalability is the ability to have little degradation in performance when facing a greater number of requests. It can be of two types:
- Vertical - Adding hardware capacity and/or memory
- Horizontal - Adding more machines with all or some of the data so that all of it is replicated at least in two machines. The software must keep all the machines in sync.
Elastic scalability refers to the capability of a cluster to seamlessly accept new nodes or removing them without any need to change the queries, rebalance data manually or restart the system.
Cassandra is highly available in the sense that if a node fails it can be replaced with no downtime and the data can be replicated through data centers to prevent that same downtime in the case of one of them experiencing a catastrophe, such as an earthquake or flood.
Eric Brewer's CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency
- Availability
- Partition Tolerance
Consistency essentially means that a read always return the most recently written value, which is guaranteed to happen when the state of a write is consistent among all nodes that have that data (the updates have a global order). Most NoSQL implementations, including Cassandra, focus on availability and partition tolerance, relaxing the consistency guarantee, providing eventual consistency.
Eventual consistency is seen by many as impracticable for sensitive data, data that cannot be lost. The reality is not so black and white, and the binary opposition between consistent and not-consistent is not truly reflected in practice, there are instead degrees of consistency such as serializability and causal consistency. In the particular case of Cassandra the consistency can be considered tuneable in the sense that the number of replicas that will block on an update can be configured on an operation basis by setting the consistency level combined with the replication factor.
The NoSQL movement members (which includes Cassandra) focus on the last two, relaxing the consistency bit. They provide what is know as eventual consistency. Having said that, let's take a closer look at Cassandra's data model.
Usually, NoSQL implementations are key-value stores that have nearly no structure in their data model apart from what can be perceived as an associative array. On the other hand, Cassandra is a row oriented database system, with a rather complex data model. It is frequently referred to as column oriented, and this is not wrong in the sense that it is not relational. But data in Cassandra is actually stored in rows indexed by a unique key, but each row does not need to have the same columns (number or type) as the ones in the same column family.
The basic building block of Cassandra are the Columns. They are nothing but a tuple with three elements, a name, a value and a timestamp. The name of column can be a string but, unlike its relational counterpart, can also be long integers, UUIDs or any kind of byte array.
Sets of columns are organized in rows that are referenced by a unique key, the row key, as demonstrated in the following figure. A row can have any number of columns that are relevant, there is no schema binding it to a predefined structure. Rows have a very important feature, that is that every operation under a single row key is atomic per replica, despite the number of columns affected. This is the only concurrency control mechanism provided by Cassandra.
The maximum level of complexity is achieved with the column families, which "glue" this whole system together, it is a structure that can keep an infinite (limited by physical storage space) number of rows, has a name and a map of keys to rows as shown here:
Cassandra also provides another dimension to columns, the SuperColumns, these are also tuples, but only have two elements, the name and the value. The value has the particularity of being a map of keys to columns (the key has to be the same as the column's name).
Multiple column families can coexist in an outer container called keyspace. The system allows for multiple keyspaces, but most of deployments have only one.
This is pretty much it. Now, it all depends on the way you use these constructs.
Be aware of one thing when using Cassandra, the values on the timestamps can be anything, but they must be consistent throughout the cluster, since this value is what allows Cassandra to define which updates are new and which are outdated (an update can be an insert, a delete or an actual update of a record).