niedziela, 15 maja 2011

Understanding Cassandra model through the CLI

Cassandra has a built-in simple command-line client. Using this client we will learn a little about cassandra model capabilities. NoSQL model requires a separation from the habits acquired by work with the RDBMS. However, at the beginning of this tutorial we will use analogies of concepts known from the RDBMS.

Start CLI:
$CASSANDRA_HOME/bin/cassandra-cli --host localhost

Most of the tutorials available on the web describe the Twitter model. I do not intend to duplicate this pattern. Due to the nature of my work, I will suggest a little less intuitive, but quite interesting and yet simple scheme that allows to store various structures and characteristics of the human genome. Let's create a keyspace.
create keyspace Genomics;

We can describe keyspace by analogy as a Oracle tablespace or MySQL database. Moreover the below command shows among others 'Keyspace system', which reminds a scheme of MySQL:
show keyspaces;

Let's switch to our set up keyspace. And again, as in MySQL:
use Genomics;

For the moment we will use analogy with the SQL. So, let's create our 'table'. In the Cassandra table is called 'column family'. Each row of our column family will represent the position in the genome. Thus a whole column family will represent the whole genome with it's properties. Table name 'hg19' comes from the 'human genome assembly 19':
create column family hg19 with comparator = 'UTF8Type' and key_validation_class = UTF8Type;

First, let's pay attention on no description of columns when defining column family. This is an important point. Slowly we start to notice the properties which differ Cassandra from RDBMS. We will describe columns in a moment and then we will understand why we do not have to define them in our column family. We will also understand why it is not table known from the RDBMS. In fact cassandra column family is not a binary relation (a subset of Cartesian product).

Second, we used a comparator concept. Comparator in the definition of 'column family' is used to determine how to compare column names with each other. Why would we ever need that? We will get enlightened in a moment. However, before that we need to fill our column family with some data.

We will put different values ​​describing the genome into our column family. The keys will be positions in the genome (the first four or five characters is the name of the chromosome: then the position on a chromosome). Let's put into the database description if the position belongs to exon or not:
set hg19['chr1:000000003']['isExon']=T;
set hg19['chr1:000000004']['isExon']=T;
set hg19['chr1:000000005']['isExon']=T;

What have we done? We inserted into the previously created column family called 'hg19' value 'T' in column 'isExon' for key 'chr1: 000000001'. And then similarly for other two keys.

From where did this column came off? We did not define it. Here, we can try compare row by analogy to map concept, more specifically to sorted map known eg from Java. Since this is a map, the number of keys in the map is virtually unlimited. So what is the column family? Simple. This is a map of maps...

Let's query for data:
get hg19['chr1:000000003'];

As a result we should get something like:
=> (column=isExon, value=54, timestamp=1309424263154000)
Returned 1 results.

What the hell? We inputed 'T' and we have got '54' as a value. It is because default data type for column is ByteType. Let's change it for UTF8Type:
update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type},
];

Query:
get hg19['chr1:000000003'];

Let's check the value:
=> (column=isExon, value=T, timestamp=1309424263154000)
Returned 1 results.

It's OK now. Let's query for rows that are exons:
get hg19 where isExon = T;

We got an error:
No indexed columns present in index clause with operator EQ

The error occurs because we did not set up secondary index on column. Let's do it:
update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
];

Query again:
get hg19 where isExon = T;

We should get:
-------------------
RowKey: chr1:000000003
=> (column=isExon, value=T, timestamp=1309424263154000)
-------------------
RowKey: chr1:000000005
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=isExon, value=T, timestamp=1309424263230000)

HINT: if no results occurs at this point it is because Cassandra caching. The easiest way to get results at this point is to just restart Cassandra. Setting up caching properties is preferred for more advanced users.

Let's add next column 'cons' that will represent conservation score for particular positions:
update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
{column_name: 'cons', validation_class: IntegerType, index_type: KEYS}
];

Let's add some values into 'cons' column:
set hg19['chr1:000000001']['cons']=0;
set hg19['chr1:000000002']['cons']=2;
set hg19['chr1:000000003']['cons']=3;
set hg19['chr1:000000004']['cons']=13;
set hg19['chr1:000000005']['cons']=4;

Query:
get hg19 where isExon = T and cons > 3;

We should get:

-------------------
RowKey: chr1:000000005
=> (column=cons, value=4, timestamp=1309427028936000)
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=cons, value=13, timestamp=1309427026753000)
=> (column=isExon, value=T, timestamp=1309424263230000)

In the final script we will create column family 'hg18' of type super. What is super type? It allows for having super columns. WTF is super column? Easy, it is map of columns. Let's look at the example script:

drop column family hg18;
create column family hg18 with 
column_type = Super
and comparator = UTF8Type
and subcomparator = UTF8Type
and key_validation_class = UTF8Type
and column_metadata =[
  {column_name:isExon, validation_class:UTF8Type},
  {column_name:cons, validation_class:IntegerType},
  ]
;
set hg18['chr1:000000004'][geneFeatures][isExon] = T;
set hg18['chr1:000000004'][conservation][cons] = 13;

Query:
list hg18;

Result:
-------------------
RowKey: chr1:000000004
=> (super_column=conservation,
     (column=cons, value=13, timestamp=1309435176079000))
=> (super_column=geneFeautres,
     (column=isExon, value=T, timestamp=1309435176073000))

1 Row Returned.

In summary we created a map (Keyspace: 'Genomics') of maps (Column families: eg 'hg18') of maps (Super columns: eg 'geneFeatures') of maps (Columns: eg isExon) which is semi structured (Column_metadata).

That's it. The idea is very simple.

poniedziałek, 9 maja 2011

Installation of Hadoop and HBase in standalone mode

Manual installation of Hadoop is perfectly described on Michael Noll's blog:
Link

However, if we are going to install HBase a problem occurs. Particular HBase versions are incompatible with particular Hadoop versions. For example, Hadoop 0.21 doesn't work with HBase 0.90. The problem is described here:
Link

Therefore, it is much easier to use ready packages distributed by Cloudera. Personally, I enjoy the Ubuntu Linux. Thus, all commands come from this system. Installation process on different platforms is described on Cloudera's web page:
Link

First, add Cloudera repository to apt:
echo "deb http://archive.cloudera.com/debian maverick-cdh3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib" | sudo tee /etc/apt/sources.list.d/cloudera.list

Second, install curl:
sudo apt-get install curl

Add Cloudera key to trusted keys:
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Update:
sudo apt-get update

Install Hadoop:
sudo apt-get install hadoop-0.20

Finally, install HBase:
sudo apt-get install hadoop-hbase

Install HBase Master:
sudo apt-get install hadoop-hbase-master

Run:
hbase shell