Crazy little thing called hacks: Cassandra

Showing posts with label Cassandra. Show all posts

Tuesday, May 31, 2011

Bridging the gap between SQL ao NoSQL: A state of the art

Here is a state of the art report I wrote on SQL and NoSQL, and a way to bring them closer. This is actually the theme of my master thesis, so you should probably get some more posts on this topic in the future.

Hasta. ;)

Artigo-MI-STAR

Monday, May 9, 2011

Running a Cassandra cluster with only one machine

I've noticed that if you want to run a cassandra cluster on your own pc, for the purpose of small tests, there is no guide in the wiki to do just that.

Therefore, here is how I've done it.

First of you'll need to create an alias for you network interface:

Mac OS
ifconfig en0 alias 192.168.0.2

Linux
ifconfig eth0:0 192.168.0.2

Here I've chosen the en0 (or eth0) interface, but you can choose the one you like, and also the IP address you like.

The first file you'll have to edit is the conf/cassandra.yaml:

- Change the commit_log, data and saved_caches directories, so it doesn't conflict with the ones from the previous "node";
- Change the rpc_port (used for Thrift or Avro) to one that is free;
- Change the listen_address to the IP of your "fake" interface.

Next open the conf/cassandra-env.sh file and change the JMX_PORT.

The last file to edit is the bin/cassandra.in.sh where you'll need to change all the occurences of $CASSANDRA_HOME to the path of the "node". For example, if you're bin directory is in /XXX/YYY/node2/bin, the path is /XXX/YYY/node2.

You can do this to create as many nodes as you want, and then just run them as usual, with bin/cassandra -f

Sunday, May 1, 2011

Inserting data with Thrift and Cassandra 0.7

A lot has changed from Cassandra 0.6 to 0.7, and sometimes it is hard to find examples of how things work. I'll be posting how to's on some of the most usual operations you might want to perform when using Cassandra, written in Java.

First of you have to establish a connection to the server:



TFramedTransport transport = new TFramedTransport(new TSocket("localhost", 9160));

Cassandra.Client client = new Cassandra.Client(new TBinaryProtocol(transport));

transport.open();

Here I'm using localhost and the default port for Cassandra, but this can be configured.

One difference to the previous versions of Cassandra is that the connection can be bound to a keyspace, and can be set as so:



client.set_keyspace("Keyspace");

With the connection established you need only the data to insert, now. This data is passed to the server in the form of mutations (org.apache.cassandra.thrift.Mutation).

In this example, I'll be adding a column to a row in a column family in the predefined keyspace.



List<Mutation> insertion_list = new ArrayList<Mutation>();



Column col_to_add = new Column(ByteBuffer.wrap(("name").getBytes("UTF8")), ByteBuffer.wrap(("value").getBytes("UTF8")),System.currentTimeMillis());



Mutation mut = new Mutation();

mut.setColumn_or_supercolumn(new ColumnOrSuperColumn().setColumn(col_to_add));

insertion_list.add(mut);



Map<String,List<Mutation>> columnFamilyValues = new HashMap<String,List<Mutation>>();

columnFamilyValues.put("columnFamily",insertion_list);



Map<ByteBuffer,<String,List<Mutation>>> rowDefinition = new HashMap<ByteBuffer,<String,List<Mutation>>>();

rowDefinition.put(ByteBuffer.wrap(("key").getBytes("UTF8")),columnFamilyValues);



client.batch_mutate(rowDefinition,ConsistencyLevel.ONE);

The code is pretty much self explaining, apart from some values that can be reconfigured at will, as the encoding of the strings (I've used UTF8), and the consistency level of the insertion (I've used ONE).

In the case of the consistency levels you should check out Cassandra's wiki, to better understand it's usage.

To close the connection to the server it as easy as,



transport.close();

Hope you find this useful. Next I'll give an example of how to get data from the server, as soon as I have some time. ;)

Saturday, October 16, 2010

Cassandra, the Data Model

UPDATE: Sorry for the images being down for so long. I've finally had the time to re-upload them, and while I was at it, I re wrote the whole post.

For my master's thesis I'm going to be working with Cassandra, an open source distributed database management system, and therefore I'll probably write a lot about it throughout the next year. To get started let's take a look at one of the biggest differences from this kind of DBMS to the classical relational systems, the data model.

Cassandra that was created on Facebook, first started as an incubation project at Apache in January of 2009 and is based on Dynamo and BigTable. This system can be defined as an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database.

Cassandra is distributed, which means that it is capable of running on multiple machines while the users see it as if it was running in only one. More than that, Cassandra is built and optimized to run in more than one machine. So much that you cannot take full advantage of all of its features without doing so. In Cassandra, all of the nodes are identical, there is no such thing as a node that is responsible for certain organizing operations, as in BigTable or HBase. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead.

Being decentralized means that there is no single point of failure, because all the servers are symmetrical. The main advantages of decentralization are that it is easier to use than master/slave and it helps to avoid suspension in service, thus supporting high availability.

Scalability is the ability to have little degradation in performance when facing a greater number of requests. It can be of two types:

Vertical - Adding hardware capacity and/or memory
Horizontal - Adding more machines with all or some of the data so that all of it is replicated at least in two machines. The software must keep all the machines in sync.

Elastic scalability refers to the capability of a cluster to seamlessly accept new nodes or removing them without any need to change the queries, rebalance data manually or restart the system.

Cassandra is highly available in the sense that if a node fails it can be replaced with no downtime and the data can be replicated through data centers to prevent that same downtime in the case of one of them experiencing a catastrophe, such as an earthquake or flood.

Eric Brewer's CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

Consistency
Availability
Partition Tolerance

The next figure provides a visual explanation of the theorem, with a focus on the two guarantees given by Cassandra.

Consistency essentially means that a read always return the most recently written value, which is guaranteed to happen when the state of a write is consistent among all nodes that have that data (the updates have a global order). Most NoSQL implementations, including Cassandra, focus on availability and partition tolerance, relaxing the consistency guarantee, providing eventual consistency.

Eventual consistency is seen by many as impracticable for sensitive data, data that cannot be lost. The reality is not so black and white, and the binary opposition between consistent and not-consistent is not truly reflected in practice, there are instead degrees of consistency such as serializability and causal consistency. In the particular case of Cassandra the consistency can be considered tuneable in the sense that the number of replicas that will block on an update can be configured on an operation basis by setting the consistency level combined with the replication factor.

The NoSQL movement members (which includes Cassandra) focus on the last two, relaxing the consistency bit. They provide what is know as eventual consistency. Having said that, let's take a closer look at Cassandra's data model.

Usually, NoSQL implementations are key-value stores that have nearly no structure in their data model apart from what can be perceived as an associative array. On the other hand, Cassandra is a row oriented database system, with a rather complex data model. It is frequently referred to as column oriented, and this is not wrong in the sense that it is not relational. But data in Cassandra is actually stored in rows indexed by a unique key, but each row does not need to have the same columns (number or type) as the ones in the same column family.

The basic building block of Cassandra are the Columns. They are nothing but a tuple with three elements, a name, a value and a timestamp. The name of column can be a string but, unlike its relational counterpart, can also be long integers, UUIDs or any kind of byte array.

Sets of columns are organized in rows that are referenced by a unique key, the row key, as demonstrated in the following figure. A row can have any number of columns that are relevant, there is no schema binding it to a predefined structure. Rows have a very important feature, that is that every operation under a single row key is atomic per replica, despite the number of columns affected. This is the only concurrency control mechanism provided by Cassandra.

The maximum level of complexity is achieved with the column families, which "glue" this whole system together, it is a structure that can keep an infinite (limited by physical storage space) number of rows, has a name and a map of keys to rows as shown here:

Cassandra also provides another dimension to columns, the SuperColumns, these are also tuples, but only have two elements, the name and the value. The value has the particularity of being a map of keys to columns (the key has to be the same as the column's name).

There is a variation of ColumnFamilies that are SuperColumnFamilies. The only difference is that where a ColumnFamily has a collection of name/value pairs, a SuperColumnFamily has subcolumns (named groups of columns).

Multiple column families can coexist in an outer container called keyspace. The system allows for multiple keyspaces, but most of deployments have only one.

This is pretty much it. Now, it all depends on the way you use these constructs.

Be aware of one thing when using Cassandra, the values on the timestamps can be anything, but they must be consistent throughout the cluster, since this value is what allows Cassandra to define which updates are new and which are outdated (an update can be an insert, a delete or an actual update of a record).

Páginas