LargeData: 2011

czwartek, 3 listopada 2011

Installing Java6 JDK on Ubuntu 11.10

sudo add-apt-repository ppa:ferramroberto/java
sudo apt-get update
sudo apt-get install sun-java6-jdk sun-java6-plugin

niedziela, 31 lipca 2011

Initialization of github repository

sudo apt-get install git-core git-gui git-doc
ssh -T git@github.com
git config --global user.name "John Doe"
git config --global user.email "doe.john@gmail.com"
git config --global github.user jodoe
git config --global github.token 12345yourtokenabcdef
git init
git add .
git commit -m 'Initial commit'
git remote add origin git@github.com:jodoe/jodoeproject.git
git push -u origin master

[solved] Cassandra TApplicationError type: 6

Today I have got strange error while using cassandra:

me.prettyprint.hector.api.exceptions.HCassandraInternalException:
Cassandra encountered an internal error processing this request:
TApplicationError type: 6 message:Internal error processing
batch_mutate

It was because I am using OrderPreservingPartitioner with IntegerType as key validator.
Trying to insert key over 127 gives above error.
This partitioner sholud be used with one of String types (e.g. utf8).

Sollution: ByteOrderedPartitioner.

środa, 6 lipca 2011

[solved] Missing artifact hector-core

Another trap...

While trying to add hector-core dependency through the m2eclipse I got this error:

Missing artifact me.prettyprint:hector-core:bundle:0.8.0-1:compile

Hint: this is not a bundle...

40 minutes is lost...

wtorek, 5 lipca 2011

Plotting ROC curves in ggplot2

Default ROC curves in R are disgusting. ggplot2 comes to the rescue.

First, let's write some data generating function that will be useful for ROC:

getExampleDataForROC <- function(len = 100, distort = 1) {
  half <- round(len/2)
  ### Group ###
  group <- vector()
  group[1:half] <- "Disease"
  group[(half + 1):len] <- "Control"
  group <- as.factor(group)
  ### Score ###
  score <- vector()
  score[1:half] <- 1
  score[(half + 1):len] <- -1
  for (i in 1:len)
    score[i] <- score[i] + rnorm(1, 0, distort)
  order <- order(score)
  data.frame(pos = score[order], group = group[order])
}

Let's write ROC plotting functions:

getPointsForROC <- function(len = 100, distort = 1) {
  basal <- getExampleDataForROC(len, distort)
  tp <- vector(); tn <-vector(); fp <-vector(); fn <- vector()
  tpr <- vector(); fpr <- vector()
  acc <- vector(); spc <- vector()
  len <- dim(basal)[1]
  for(i in 1:len-1) {
    tp[i] <- sum(basal[(i+1):len,2] == "Disease")
    tn[i] <- sum(basal[1:i,2] == "Control")
    fp[i] <- sum(basal[(i+1):len,2] == "Control")
    fn[i] <- sum(basal[1:i,2] == "Disease")
    tpr[i] <- tp[i] / (tp[i] + fn[i])
    fpr[i] <- fp[i] / (fp[i] + tn[i])
    acc[i] <- (tp[i] + tn[i]) / ((tp[i] + fn[i]) +  (fp[i] + tn[i]))
    spc[i] <- 1 - fpr[i]
  }
  points <- (cbind(fpr,tpr))[(len-1):1,]
  points <- rbind(points, c(1,1))
  return(points)
} 


plotROC <- function(len = 100) {
  points.part1 <- getPointsForROC(len, 1)
  desc.part1 <- rep("Distort = 1",len)
  points.part2 <- getPointsForROC(len, 1.5)
  desc.part2 <- rep("Distort = 1.5",len)
  points.part3 <- getPointsForROC(len, 2)
  desc.part3 <- rep("Distort = 2",len)
  points.part4 <- getPointsForROC(len, 2.5)
  desc.part4 <- rep("Distort = 2.5",len)

  points <- rbind(points.part1, points.part2, points.part3, points.part4)
  desc <- c(desc.part1, desc.part2, desc.part3, desc.part4)
  data <- data.frame(TPR = points[,2], FPR = points[,1], GeneSet = desc)
  colors = brewer.pal(5, "YlOrRd")[2:5]
  qplot(FPR, TPR, data = data, geom = "blank", main = "ROC curve", xlab = "False Positive Rate (1-Specificity)", ylab = "True Positive Rate (Sensitivity)" ) + geom_line(aes(x = FPR, y = TPR, data = data, colour = GeneSet), size = 2, alpha = 0.7) + scale_colour_manual(values=colors)
}

Finally, let's generate plot:

plotROC(1000)

poniedziałek, 4 lipca 2011

Plotting PCA results in ggplot2

Default PCA plots in R are disgusting. ggplot2 comes to the rescue.

First, let's write some data generating functions that will be useful for PCA:

getVector <- function(n = 20, dims = 2) {
  out <- vector()
  breaks <- round((n/dims)*(1:dims))
  start <- 1  
  for(i in 1:dims) {
    num <- rnorm(1, 10, 1)
    for(j in start:(breaks[i])) {
      out[j] <- num + rnorm(1, 0, 0.2)
    }
    start <- breaks[i] + 1
  } 
  out
}

getData <- function(rows = 10, cols = 20, groups = 2, dims = 2, distort = 0.5) {
  out <- matrix(data = 10, nrow = rows, ncol = cols)
  ### make groups ###
  breaks <- round((rows/groups)*(1:groups))
  start <- 1  
  for(i in 1:groups) {
      vector <- getVector(n = cols, dim = 2)
    for(j in start:(breaks[i])) {
      out[j,] <- vector
    }
    start <- breaks[i] + 1
  }
  ### make error ###
  for(i in 1:rows) {
    out[i,] <- out[i,] + rnorm(n = cols, mean = 0, sd = distort)
  }
  ### return ###
  rownames(out) <- paste("Row ", 1:rows, sep = "")
  colnames(out) <- paste("Col ", 1:cols, sep = "")
  out
}

getGroup <- function(rows = 10, groups = 2) {
  out <- vector()
  breaks <- round((rows/groups)*(1:groups))
  start <- 1  
  for(i in 1:groups) {
    for(j in start:(breaks[i])) {
      out[j] <- i
    }
    start <- breaks[i] + 1
  }
  out <- as.factor(out)
  out
}

Let's write PCA plotting function:

makePcaPlot <- function(x = getData(), group = getGroup(), title = "") {
  require(ggplot2)
  require(RColorBrewer)
  data <- x
  data <- t(apply(data, 1, scale))
  rownames(data) <- rownames(x)
  colnames(data) <- colnames(x)
  mydata <- t(data)  
  groupFactor <- group
  mydata.pca <- prcomp(mydata, retx=TRUE, center=TRUE, scale.=TRUE)  

  percent <- round((((mydata.pca$sdev)^2 / sum(mydata.pca$sdev^2))*100)[1:2])
  loadings <- mydata.pca$rotation
  rownames(loadings) <- colnames(mydata)
  scores <- mydata.pca$x
  cex = 3
  group <- as.character(groupFactor)
  what <- groupFactor == "1"; group[what] <- "One"
  what <- groupFactor == "2"; group[what] <- "Two"
  group <- factor(group)
  group <- factor(group, levels=c("Two", "One"), ordered=TRUE)
  colors = sample(1:11, size = length(levels(group)))


  PCA1 <- scores[,1]
  PCA2 <- scores[,2]
  base_size = 10

  qplot(PCA2, PCA1, geom="blank", main = title, xlab = paste("PCA2 (", percent[2], "%)", sep = ""), ylab = paste("PCA1 (", percent[1], "%)", sep = "")) + geom_point(aes(colour = group), size = 6, alpha = 3/4) + scale_colour_manual(values=brewer.pal(11, "RdYlGn")[11:1][colors]) + opts(
      axis.text.x = theme_text(size = base_size * 1.3 , lineheight = 0.9, colour = "grey50", hjust = 1, angle = 90),
      axis.text.y =       theme_text(size = base_size * 1.3, lineheight = 0.9, colour = "grey50", hjust = 1)
      )
}

Finally, let's generate plot:

makePcaPlot(getData(30,4,2,distort = 0.7), getGroup(30,2))

sobota, 2 lipca 2011

Spring Tool Suite (STS) Maven ‘Add Dependency’ does not work

There is small but annoying issue in Spring Tool Suite. After right-click on a project and selection of "add dependency" no search results are found.

1. Preferences -> Maven -> Download repository index on startup
2. Restart STS
3. Wait for index download

Or add to pom.xml


<repositories>

 <repository>

       <id>central</id>

       <name>Maven Repository Switchboard</name>

       <layout>default</layout>

       <url>http://repo1.maven.org/maven2</url>

       <snapshots>

         <enabled>false</enabled>

       </snapshots>

    </repository>

</repositories>

niedziela, 15 maja 2011

Understanding Cassandra model through the CLI

Cassandra has a built-in simple command-line client. Using this client we will learn a little about cassandra model capabilities. NoSQL model requires a separation from the habits acquired by work with the RDBMS. However, at the beginning of this tutorial we will use analogies of concepts known from the RDBMS.

Start CLI:

$CASSANDRA_HOME/bin/cassandra-cli --host localhost

Most of the tutorials available on the web describe the Twitter model. I do not intend to duplicate this pattern. Due to the nature of my work, I will suggest a little less intuitive, but quite interesting and yet simple scheme that allows to store various structures and characteristics of the human genome. Let's create a keyspace.

create keyspace Genomics;

We can describe keyspace by analogy as a Oracle tablespace or MySQL database. Moreover the below command shows among others 'Keyspace system', which reminds a scheme of MySQL:

show keyspaces;

Let's switch to our set up keyspace. And again, as in MySQL:

use Genomics;

For the moment we will use analogy with the SQL. So, let's create our 'table'. In the Cassandra table is called 'column family'. Each row of our column family will represent the position in the genome. Thus a whole column family will represent the whole genome with it's properties. Table name 'hg19' comes from the 'human genome assembly 19':

create column family hg19 with comparator = 'UTF8Type' and key_validation_class = UTF8Type;

First, let's pay attention on no description of columns when defining column family. This is an important point. Slowly we start to notice the properties which differ Cassandra from RDBMS. We will describe columns in a moment and then we will understand why we do not have to define them in our column family. We will also understand why it is not table known from the RDBMS. In fact cassandra column family is not a binary relation (a subset of Cartesian product).

Second, we used a comparator concept. Comparator in the definition of 'column family' is used to determine how to compare column names with each other. Why would we ever need that? We will get enlightened in a moment. However, before that we need to fill our column family with some data.

We will put different values describing the genome into our column family. The keys will be positions in the genome (the first four or five characters is the name of the chromosome: then the position on a chromosome). Let's put into the database description if the position belongs to exon or not:

set hg19['chr1:000000003']['isExon']=T;
set hg19['chr1:000000004']['isExon']=T;
set hg19['chr1:000000005']['isExon']=T;

What have we done? We inserted into the previously created column family called 'hg19' value 'T' in column 'isExon' for key 'chr1: 000000001'. And then similarly for other two keys.

From where did this column came off? We did not define it. Here, we can try compare row by analogy to map concept, more specifically to sorted map known eg from Java. Since this is a map, the number of keys in the map is virtually unlimited. So what is the column family? Simple. This is a map of maps...

Let's query for data:

get hg19['chr1:000000003'];

As a result we should get something like:

=> (column=isExon, value=54, timestamp=1309424263154000)
Returned 1 results.

What the hell? We inputed 'T' and we have got '54' as a value. It is because default data type for column is ByteType. Let's change it for UTF8Type:

update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type},
];

Query:

get hg19['chr1:000000003'];

Let's check the value:

=> (column=isExon, value=T, timestamp=1309424263154000)
Returned 1 results.

It's OK now. Let's query for rows that are exons:

get hg19 where isExon = T;

We got an error:

No indexed columns present in index clause with operator EQ

The error occurs because we did not set up secondary index on column. Let's do it:

update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
];

Query again:

get hg19 where isExon = T;

We should get:

-------------------
RowKey: chr1:000000003
=> (column=isExon, value=T, timestamp=1309424263154000)
-------------------
RowKey: chr1:000000005
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=isExon, value=T, timestamp=1309424263230000)

HINT: if no results occurs at this point it is because Cassandra caching. The easiest way to get results at this point is to just restart Cassandra. Setting up caching properties is preferred for more advanced users.

Let's add next column 'cons' that will represent conservation score for particular positions:

update column family hg19 with 
column_metadata =  
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
{column_name: 'cons', validation_class: IntegerType, index_type: KEYS}
];

Let's add some values into 'cons' column:

set hg19['chr1:000000001']['cons']=0;
set hg19['chr1:000000002']['cons']=2;
set hg19['chr1:000000003']['cons']=3;
set hg19['chr1:000000004']['cons']=13;
set hg19['chr1:000000005']['cons']=4;

Query:

get hg19 where isExon = T and cons > 3;

We should get:


-------------------
RowKey: chr1:000000005
=> (column=cons, value=4, timestamp=1309427028936000)
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=cons, value=13, timestamp=1309427026753000)
=> (column=isExon, value=T, timestamp=1309424263230000)

In the final script we will create column family 'hg18' of type super. What is super type? It allows for having super columns. WTF is super column? Easy, it is map of columns. Let's look at the example script:

drop column family hg18;
create column family hg18 with 
column_type = Super
and comparator = UTF8Type
and subcomparator = UTF8Type
and key_validation_class = UTF8Type
and column_metadata =[
  {column_name:isExon, validation_class:UTF8Type},
  {column_name:cons, validation_class:IntegerType},
  ]
;
set hg18['chr1:000000004'][geneFeatures][isExon] = T;
set hg18['chr1:000000004'][conservation][cons] = 13;

Query:

list hg18;

Result:

-------------------
RowKey: chr1:000000004
=> (super_column=conservation,
     (column=cons, value=13, timestamp=1309435176079000))
=> (super_column=geneFeautres,
     (column=isExon, value=T, timestamp=1309435176073000))

1 Row Returned.

In summary we created a map (Keyspace: 'Genomics') of maps (Column families: eg 'hg18') of maps (Super columns: eg 'geneFeatures') of maps (Columns: eg isExon) which is semi structured (Column_metadata).

That's it. The idea is very simple.

poniedziałek, 9 maja 2011

Installation of Hadoop and HBase in standalone mode

Manual installation of Hadoop is perfectly described on Michael Noll's blog:
Link

However, if we are going to install HBase a problem occurs. Particular HBase versions are incompatible with particular Hadoop versions. For example, Hadoop 0.21 doesn't work with HBase 0.90. The problem is described here:
Link

Therefore, it is much easier to use ready packages distributed by Cloudera. Personally, I enjoy the Ubuntu Linux. Thus, all commands come from this system. Installation process on different platforms is described on Cloudera's web page:
Link

First, add Cloudera repository to apt:

echo "deb http://archive.cloudera.com/debian maverick-cdh3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib" | sudo tee /etc/apt/sources.list.d/cloudera.list

Second, install curl:

sudo apt-get install curl

Add Cloudera key to trusted keys:

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Update:

sudo apt-get update

Install Hadoop:

sudo apt-get install hadoop-0.20

Finally, install HBase:

sudo apt-get install hadoop-hbase

Install HBase Master:

sudo apt-get install hadoop-hbase-master

Run:

hbase shell