Páginas

Saturday, October 16, 2010

Cassandra, the Data Model

UPDATE: Sorry for the images being down for so long. I've finally had the time to re-upload them, and while I was at it, I re wrote the whole post.

For my master's thesis I'm going to be working with Cassandra, an open source distributed database management system, and therefore I'll probably write a lot about it throughout the next year. To get started let's take a look at one of the biggest differences from this kind of DBMS to the classical relational systems, the data model.

Cassandra that was created on Facebook, first started as an incubation project at Apache in January of 2009 and is based on Dynamo and BigTable. This system can be defined as an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database.

Cassandra is distributed, which means that it is capable of running on multiple machines while the users see it as if it was running in only one. More than that, Cassandra is built and optimized to run in more than one machine. So much that you cannot take full advantage of all of its features without doing so. In Cassandra, all of the nodes are identical, there is no such thing as a node that is responsible for certain organizing operations, as in BigTable or HBase. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.

Being decentralized means that there is no single point of failure, because all the servers are symmetrical. The main advantages of decentralization are that it is easier to use than master/slave and it helps to avoid suspension in service, thus supporting high availability.

Scalability is the ability to have little degradation in performance when facing a greater number of requests. It can be of two types:

  • Vertical - Adding hardware capacity and/or memory
  • Horizontal - Adding more machines with all or some of the data so that all of it is replicated at least in two machines. The software must keep all the machines in sync. 

Elastic scalability refers to the capability of a cluster to seamlessly accept new nodes or removing them without any need to change the queries, rebalance data manually or restart the system.

Cassandra is highly available in the sense that if a node fails it can be replaced with no downtime and the data can be replicated through data centers to prevent that same downtime in the case of one of them experiencing a catastrophe, such as an earthquake or flood. 

Eric Brewer's CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
  • Consistency
  • Availability
  • Partition Tolerance
The next figure provides a visual explanation of the theorem, with a focus on the two guarantees given by Cassandra.



Consistency essentially means that a read always return the most recently written value, which is guaranteed to happen when the state of a write is consistent among all nodes that have that data (the updates have a global order). Most NoSQL implementations, including Cassandra, focus on availability and partition tolerance, relaxing the consistency guarantee, providing eventual consistency.

Eventual consistency is seen by many as impracticable for sensitive data, data that cannot be lost. The reality is not so black and white, and the binary opposition between consistent and not-consistent is not truly reflected in practice, there are instead degrees of consistency such as serializability and causal consistency. In the particular case of Cassandra the consistency can be considered tuneable in the sense that the number of replicas that will block on an update can be configured on an operation basis by setting the consistency level combined with the replication factor.


 The NoSQL movement members (which includes Cassandra) focus on the last two, relaxing the consistency bit. They provide what is know as eventual consistency. Having said that, let's take a closer look at Cassandra's data model.

Usually, NoSQL implementations are key-value stores that have nearly no structure in their data model apart from what can be perceived as an associative array. On the other hand, Cassandra is a row oriented database system, with a rather complex data model. It is frequently referred to as column oriented, and this is not wrong in the sense that it is not relational. But data in Cassandra is actually stored in rows indexed by a unique key, but each row does not need to have the same columns (number or type) as the ones in the same column family.

The basic building block of Cassandra are the Columns. They are nothing but a tuple with three elements, a name, a value and a timestamp. The name of column can be a string but, unlike its relational counterpart, can also be long integers, UUIDs or any kind of byte array.



Sets of columns are organized in rows that are referenced by a unique key, the row key, as demonstrated in the following figure. A row can have any number of columns that are relevant, there is no schema binding it to a predefined structure. Rows have a very important feature, that is that every operation under a single row key is atomic per replica, despite the number of columns affected. This is the only concurrency control mechanism provided by Cassandra.




The maximum level of complexity is achieved with the column families, which "glue" this whole system together, it is a structure that can keep an infinite (limited by physical storage space) number of rows, has a name and a map of keys to rows as shown here:



Cassandra also provides another dimension to columns, the SuperColumns, these are also tuples, but only have two elements, the name and the value. The value has the particularity of being a map of keys to columns (the key has to be the same as the column's name).




There is a variation of ColumnFamilies that are SuperColumnFamilies. The only difference is that where a ColumnFamily has a collection of name/value pairs, a SuperColumnFamily has subcolumns (named groups of columns).

Multiple column families can coexist in an outer container called keyspace. The system allows for multiple keyspaces, but most of deployments have only one.

This is pretty much it. Now, it all depends on the way you use these constructs.

Be aware of one thing when using Cassandra, the values on the timestamps can be anything, but they must be consistent throughout the cluster, since this value is what allows Cassandra to define which updates are new and which are outdated (an update can be an insert, a delete or an actual update of a record).

Sunday, October 3, 2010

Bash 101: Variables and Conditions

First off I would like to make a little note on the use of quotes. In the shell, variables are separated by whitespaces, if you want those characters to belong to the variable you'll have to quote them.

There are 3 types of quotes, double, single and the backslash, that have the following results:
  • Double - Accepts whitespaces and expands other variables
  • Single -   Accepts whitespaces and doesn't expand other variables
  • Backslash - Escapes the value of $
In my previous post I've talked about normal variables, here I'll talk about the other two types of variables, environment and parameter.

Environment Variables


At the start of any shell script some variables are initialized with values defined in the environment (you can change them with export or in the .bash_profile file). For convenience these variables are all uppercase, as opposed to the user defined that should be lowercase. Here's a list of the main ones and a brief description.
  • $HOME - Home directory of the current user
  • $PATH - The list of directories to search commands
  • $PS1 - The definition of the command prompt (eg: \h:\W \u\$)
  • $PS2 - The secondary prompt, usually >
  • $IFS - Input Field Separator. List of characters used to separate words when reading input
  • $0 - The name of the shell script
  • $# - The number of parameters passed
  • $$ - The PID of the shell script (normally used to create temporary files)
If you want to check all your environment variables just type printenv in the shell.

IBM has a pretty good hands on post to understand the setting and unsetting of environment variables, here, and refer to this other post for a more extensive overview of variables.

Parameter Variables


If your scripts is invoked with parameters it has some more variables which are defined (you can check if there are any parameters if the $# variable has a value greater than 0). These are the parameter variables:
  • $1, $2, ... - The parameters given to the script in order
  • $* - A list of all parameters in a single variables, separated by the first character in IFS
  • $@ - A variation of $*, that always uses a space to separate the parameters
I've written a small script that should make the difference between $* and $@ clear:

#!/bin/bash

export IFS=*
echo "IFS = $IFS"
echo "With IFS - $*"
echo "Without IFS - $@"

exit 0

Run it as ./script param1 param2 param3 ...

Note: The export command sets the variable for the script and all its subordinates.

Conditions


One the fundamental things of any programming language is the ability to test conditions. A shell script can test the exit code of any command it invokes, even of scripts written by you. That is why it is very important to include an exit command with a value (0 if it is ok), at the end of all your scripts.

The commands used to test conditions are two synonyms, test and [. Obviously, if you use [ it must have a matching ], and because it makes your code much easier to read, this is the most used construct.

In a shell script, a test should look something like

if [ -f file ]
then
    echo "File exists"
else
    echo "File does not exist"
fi

The exit code of either these commands is what determines the veracity or not of the statement (again, 0 for true and 1 for false). A little thing to remember is that [ is a command, therefore you must put spaces between it and the condition, or else it won't work.

There are 3 types of conditions that can be used with these commands:

String Comparison

  • string1 = string2 - True if strings are equal
  • string1 != string2 - True if strings are not equal
  • -n string - True if string is not null
  • -z string - True if string is null (empty)

Arithmetic Comparison

  • exp1 -eq exp2 - True if the expressions are equal
  • exp1 -ne exp2 - True if the expressions are not equal
  • exp1 -gt exp2 - True if exp1 is greater than exp2
  • exp1 -ge exp2 - True if exp1 is greater than or equal to exp2
  • exp1 -lt exp2 - True if exp1 is less than exp2
  • exp1 -le exp2 - True if exp1 is less than or equal to exp2
  • ! exp - True if the expression is false and vice versa

File Conditional

  • -d file - True if the file is a directory
  • -e file - True if the file exists (-f is usually used instead)
  • -f file - True if the file is a regular file
  • -g file - True if set-group-id is set on file
  • -u file - True if the set-user-id is set on file
  • -s file - True if the file has nonzero size
  • -r file - True if the file is readable
  • -w file - True if the file is writable
  • -x file - True is the file is executable
These are the more commonly used options, for a complete list type help test in your bash.