sudo add-apt-repository ppa:ferramroberto/java
sudo apt-get update
sudo apt-get install sun-java6-jdk sun-java6-plugin
czwartek, 3 listopada 2011
Installing Java6 JDK on Ubuntu 11.10
niedziela, 31 lipca 2011
Initialization of github repository
sudo apt-get install git-core git-gui git-doc
ssh -T git@github.com
git config --global user.name "John Doe"
git config --global user.email "doe.john@gmail.com"
git config --global github.user jodoe
git config --global github.token 12345yourtokenabcdef
git init
git add .
git commit -m 'Initial commit'
git remote add origin git@github.com:jodoe/jodoeproject.git
git push -u origin master
ssh -T git@github.com
git config --global user.name "John Doe"
git config --global user.email "doe.john@gmail.com"
git config --global github.user jodoe
git config --global github.token 12345yourtokenabcdef
git init
git add .
git commit -m 'Initial commit'
git remote add origin git@github.com:jodoe/jodoeproject.git
git push -u origin master
[solved] Cassandra TApplicationError type: 6
Today I have got strange error while using cassandra:
me.prettyprint.hector.api.exceptions.HCassandraInternalException:
Cassandra encountered an internal error processing this request:
TApplicationError type: 6 message:Internal error processing
batch_mutate
It was because I am using OrderPreservingPartitioner with IntegerType as key validator.
Trying to insert key over 127 gives above error.
This partitioner sholud be used with one of String types (e.g. utf8).
Sollution: ByteOrderedPartitioner.
me.prettyprint.hector.api.exceptions.HCassandraInternalException:
Cassandra encountered an internal error processing this request:
TApplicationError type: 6 message:Internal error processing
batch_mutate
It was because I am using OrderPreservingPartitioner with IntegerType as key validator.
Trying to insert key over 127 gives above error.
This partitioner sholud be used with one of String types (e.g. utf8).
Sollution: ByteOrderedPartitioner.
środa, 6 lipca 2011
[solved] Missing artifact hector-core
Another trap...
While trying to add hector-core dependency through the m2eclipse I got this error:
Hint: this is not a bundle...
40 minutes is lost...
While trying to add hector-core dependency through the m2eclipse I got this error:
Missing artifact me.prettyprint:hector-core:bundle:0.8.0-1:compile
Hint: this is not a bundle...
40 minutes is lost...
wtorek, 5 lipca 2011
Plotting ROC curves in ggplot2
Default ROC curves in R are disgusting. ggplot2 comes to the rescue.
First, let's write some data generating function that will be useful for ROC:
Let's write ROC plotting functions:
First, let's write some data generating function that will be useful for ROC:
getExampleDataForROC <- function(len = 100, distort = 1) {
half <- round(len/2)
### Group ###
group <- vector()
group[1:half] <- "Disease"
group[(half + 1):len] <- "Control"
group <- as.factor(group)
### Score ###
score <- vector()
score[1:half] <- 1
score[(half + 1):len] <- -1
for (i in 1:len)
score[i] <- score[i] + rnorm(1, 0, distort)
order <- order(score)
data.frame(pos = score[order], group = group[order])
}
Let's write ROC plotting functions:
getPointsForROC <- function(len = 100, distort = 1) {
basal <- getExampleDataForROC(len, distort)
tp <- vector(); tn <-vector(); fp <-vector(); fn <- vector()
tpr <- vector(); fpr <- vector()
acc <- vector(); spc <- vector()
len <- dim(basal)[1]
for(i in 1:len-1) {
tp[i] <- sum(basal[(i+1):len,2] == "Disease")
tn[i] <- sum(basal[1:i,2] == "Control")
fp[i] <- sum(basal[(i+1):len,2] == "Control")
fn[i] <- sum(basal[1:i,2] == "Disease")
tpr[i] <- tp[i] / (tp[i] + fn[i])
fpr[i] <- fp[i] / (fp[i] + tn[i])
acc[i] <- (tp[i] + tn[i]) / ((tp[i] + fn[i]) + (fp[i] + tn[i]))
spc[i] <- 1 - fpr[i]
}
points <- (cbind(fpr,tpr))[(len-1):1,]
points <- rbind(points, c(1,1))
return(points)
}
plotROC <- function(len = 100) {
points.part1 <- getPointsForROC(len, 1)
desc.part1 <- rep("Distort = 1",len)
points.part2 <- getPointsForROC(len, 1.5)
desc.part2 <- rep("Distort = 1.5",len)
points.part3 <- getPointsForROC(len, 2)
desc.part3 <- rep("Distort = 2",len)
points.part4 <- getPointsForROC(len, 2.5)
desc.part4 <- rep("Distort = 2.5",len)
points <- rbind(points.part1, points.part2, points.part3, points.part4)
desc <- c(desc.part1, desc.part2, desc.part3, desc.part4)
data <- data.frame(TPR = points[,2], FPR = points[,1], GeneSet = desc)
colors = brewer.pal(5, "YlOrRd")[2:5]
qplot(FPR, TPR, data = data, geom = "blank", main = "ROC curve", xlab = "False Positive Rate (1-Specificity)", ylab = "True Positive Rate (Sensitivity)" ) + geom_line(aes(x = FPR, y = TPR, data = data, colour = GeneSet), size = 2, alpha = 0.7) + scale_colour_manual(values=colors)
}
Finally, let's generate plot:plotROC(1000)
poniedziałek, 4 lipca 2011
Plotting PCA results in ggplot2
Default PCA plots in R are disgusting. ggplot2 comes to the rescue.
First, let's write some data generating functions that will be useful for PCA:
Let's write PCA plotting function:
First, let's write some data generating functions that will be useful for PCA:
getVector <- function(n = 20, dims = 2) {
out <- vector()
breaks <- round((n/dims)*(1:dims))
start <- 1
for(i in 1:dims) {
num <- rnorm(1, 10, 1)
for(j in start:(breaks[i])) {
out[j] <- num + rnorm(1, 0, 0.2)
}
start <- breaks[i] + 1
}
out
}
getData <- function(rows = 10, cols = 20, groups = 2, dims = 2, distort = 0.5) {
out <- matrix(data = 10, nrow = rows, ncol = cols)
### make groups ###
breaks <- round((rows/groups)*(1:groups))
start <- 1
for(i in 1:groups) {
vector <- getVector(n = cols, dim = 2)
for(j in start:(breaks[i])) {
out[j,] <- vector
}
start <- breaks[i] + 1
}
### make error ###
for(i in 1:rows) {
out[i,] <- out[i,] + rnorm(n = cols, mean = 0, sd = distort)
}
### return ###
rownames(out) <- paste("Row ", 1:rows, sep = "")
colnames(out) <- paste("Col ", 1:cols, sep = "")
out
}
getGroup <- function(rows = 10, groups = 2) {
out <- vector()
breaks <- round((rows/groups)*(1:groups))
start <- 1
for(i in 1:groups) {
for(j in start:(breaks[i])) {
out[j] <- i
}
start <- breaks[i] + 1
}
out <- as.factor(out)
out
}
Let's write PCA plotting function:
makePcaPlot <- function(x = getData(), group = getGroup(), title = "") {
require(ggplot2)
require(RColorBrewer)
data <- x
data <- t(apply(data, 1, scale))
rownames(data) <- rownames(x)
colnames(data) <- colnames(x)
mydata <- t(data)
groupFactor <- group
mydata.pca <- prcomp(mydata, retx=TRUE, center=TRUE, scale.=TRUE)
percent <- round((((mydata.pca$sdev)^2 / sum(mydata.pca$sdev^2))*100)[1:2])
loadings <- mydata.pca$rotation
rownames(loadings) <- colnames(mydata)
scores <- mydata.pca$x
cex = 3
group <- as.character(groupFactor)
what <- groupFactor == "1"; group[what] <- "One"
what <- groupFactor == "2"; group[what] <- "Two"
group <- factor(group)
group <- factor(group, levels=c("Two", "One"), ordered=TRUE)
colors = sample(1:11, size = length(levels(group)))
PCA1 <- scores[,1]
PCA2 <- scores[,2]
base_size = 10
qplot(PCA2, PCA1, geom="blank", main = title, xlab = paste("PCA2 (", percent[2], "%)", sep = ""), ylab = paste("PCA1 (", percent[1], "%)", sep = "")) + geom_point(aes(colour = group), size = 6, alpha = 3/4) + scale_colour_manual(values=brewer.pal(11, "RdYlGn")[11:1][colors]) + opts(
axis.text.x = theme_text(size = base_size * 1.3 , lineheight = 0.9, colour = "grey50", hjust = 1, angle = 90),
axis.text.y = theme_text(size = base_size * 1.3, lineheight = 0.9, colour = "grey50", hjust = 1)
)
}
Finally, let's generate plot:makePcaPlot(getData(30,4,2,distort = 0.7), getGroup(30,2))
sobota, 2 lipca 2011
Spring Tool Suite (STS) Maven ‘Add Dependency’ does not work
There is small but annoying issue in Spring Tool Suite. After right-click on a project and selection of "add dependency" no search results are found.
1. Preferences -> Maven -> Download repository index on startup
2. Restart STS
3. Wait for index download
Or add to pom.xml
1. Preferences -> Maven -> Download repository index on startup
2. Restart STS
3. Wait for index download
Or add to pom.xml
<repositories>
<repository>
<id>central</id>
<name>Maven Repository Switchboard</name>
<layout>default</layout>
<url>http://repo1.maven.org/maven2</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
</repositories>
niedziela, 15 maja 2011
Understanding Cassandra model through the CLI
Cassandra has a built-in simple command-line client. Using this client we will learn a little about cassandra model capabilities. NoSQL model requires a separation from the habits acquired by work with the RDBMS. However, at the beginning of this tutorial we will use analogies of concepts known from the RDBMS.
Start CLI:
Most of the tutorials available on the web describe the Twitter model. I do not intend to duplicate this pattern. Due to the nature of my work, I will suggest a little less intuitive, but quite interesting and yet simple scheme that allows to store various structures and characteristics of the human genome. Let's create a keyspace.
We can describe keyspace by analogy as a Oracle tablespace or MySQL database. Moreover the below command shows among others 'Keyspace system', which reminds a scheme of MySQL:
Let's switch to our set up keyspace. And again, as in MySQL:
For the moment we will use analogy with the SQL. So, let's create our 'table'. In the Cassandra table is called 'column family'. Each row of our column family will represent the position in the genome. Thus a whole column family will represent the whole genome with it's properties. Table name 'hg19' comes from the 'human genome assembly 19':
First, let's pay attention on no description of columns when defining column family. This is an important point. Slowly we start to notice the properties which differ Cassandra from RDBMS. We will describe columns in a moment and then we will understand why we do not have to define them in our column family. We will also understand why it is not table known from the RDBMS. In fact cassandra column family is not a binary relation (a subset of Cartesian product).
Second, we used a comparator concept. Comparator in the definition of 'column family' is used to determine how to compare column names with each other. Why would we ever need that? We will get enlightened in a moment. However, before that we need to fill our column family with some data.
We will put different values describing the genome into our column family. The keys will be positions in the genome (the first four or five characters is the name of the chromosome: then the position on a chromosome). Let's put into the database description if the position belongs to exon or not:
What have we done? We inserted into the previously created column family called 'hg19' value 'T' in column 'isExon' for key 'chr1: 000000001'. And then similarly for other two keys.
From where did this column came off? We did not define it. Here, we can try compare row by analogy to map concept, more specifically to sorted map known eg from Java. Since this is a map, the number of keys in the map is virtually unlimited. So what is the column family? Simple. This is a map of maps...
Let's query for data:
As a result we should get something like:
What the hell? We inputed 'T' and we have got '54' as a value. It is because default data type for column is ByteType. Let's change it for UTF8Type:
Query:
Let's check the value:
It's OK now. Let's query for rows that are exons:
We got an error:
The error occurs because we did not set up secondary index on column. Let's do it:
Query again:
We should get:
HINT: if no results occurs at this point it is because Cassandra caching. The easiest way to get results at this point is to just restart Cassandra. Setting up caching properties is preferred for more advanced users.
Let's add next column 'cons' that will represent conservation score for particular positions:
Let's add some values into 'cons' column:
Query:
We should get:
In the final script we will create column family 'hg18' of type super. What is super type? It allows for having super columns. WTF is super column? Easy, it is map of columns. Let's look at the example script:
Query:
Result:
In summary we created a map (Keyspace: 'Genomics') of maps (Column families: eg 'hg18') of maps (Super columns: eg 'geneFeatures') of maps (Columns: eg isExon) which is semi structured (Column_metadata).
That's it. The idea is very simple.
Start CLI:
$CASSANDRA_HOME/bin/cassandra-cli --host localhost
Most of the tutorials available on the web describe the Twitter model. I do not intend to duplicate this pattern. Due to the nature of my work, I will suggest a little less intuitive, but quite interesting and yet simple scheme that allows to store various structures and characteristics of the human genome. Let's create a keyspace.
create keyspace Genomics;
We can describe keyspace by analogy as a Oracle tablespace or MySQL database. Moreover the below command shows among others 'Keyspace system', which reminds a scheme of MySQL:
show keyspaces;
Let's switch to our set up keyspace. And again, as in MySQL:
use Genomics;
For the moment we will use analogy with the SQL. So, let's create our 'table'. In the Cassandra table is called 'column family'. Each row of our column family will represent the position in the genome. Thus a whole column family will represent the whole genome with it's properties. Table name 'hg19' comes from the 'human genome assembly 19':
create column family hg19 with comparator = 'UTF8Type' and key_validation_class = UTF8Type;
First, let's pay attention on no description of columns when defining column family. This is an important point. Slowly we start to notice the properties which differ Cassandra from RDBMS. We will describe columns in a moment and then we will understand why we do not have to define them in our column family. We will also understand why it is not table known from the RDBMS. In fact cassandra column family is not a binary relation (a subset of Cartesian product).
Second, we used a comparator concept. Comparator in the definition of 'column family' is used to determine how to compare column names with each other. Why would we ever need that? We will get enlightened in a moment. However, before that we need to fill our column family with some data.
We will put different values describing the genome into our column family. The keys will be positions in the genome (the first four or five characters is the name of the chromosome: then the position on a chromosome). Let's put into the database description if the position belongs to exon or not:
set hg19['chr1:000000003']['isExon']=T;
set hg19['chr1:000000004']['isExon']=T;
set hg19['chr1:000000005']['isExon']=T;
What have we done? We inserted into the previously created column family called 'hg19' value 'T' in column 'isExon' for key 'chr1: 000000001'. And then similarly for other two keys.
From where did this column came off? We did not define it. Here, we can try compare row by analogy to map concept, more specifically to sorted map known eg from Java. Since this is a map, the number of keys in the map is virtually unlimited. So what is the column family? Simple. This is a map of maps...
Let's query for data:
get hg19['chr1:000000003'];
As a result we should get something like:
=> (column=isExon, value=54, timestamp=1309424263154000)
Returned 1 results.
What the hell? We inputed 'T' and we have got '54' as a value. It is because default data type for column is ByteType. Let's change it for UTF8Type:
update column family hg19 with
column_metadata =
[
{column_name: 'isExon', validation_class: UTF8Type},
];
Query:
get hg19['chr1:000000003'];
Let's check the value:
=> (column=isExon, value=T, timestamp=1309424263154000)
Returned 1 results.
It's OK now. Let's query for rows that are exons:
get hg19 where isExon = T;
We got an error:
No indexed columns present in index clause with operator EQ
The error occurs because we did not set up secondary index on column. Let's do it:
update column family hg19 with
column_metadata =
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
];
Query again:
get hg19 where isExon = T;
We should get:
-------------------
RowKey: chr1:000000003
=> (column=isExon, value=T, timestamp=1309424263154000)
-------------------
RowKey: chr1:000000005
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=isExon, value=T, timestamp=1309424263230000)
HINT: if no results occurs at this point it is because Cassandra caching. The easiest way to get results at this point is to just restart Cassandra. Setting up caching properties is preferred for more advanced users.
Let's add next column 'cons' that will represent conservation score for particular positions:
update column family hg19 with
column_metadata =
[
{column_name: 'isExon', validation_class: UTF8Type, index_type: KEYS},
{column_name: 'cons', validation_class: IntegerType, index_type: KEYS}
];
Let's add some values into 'cons' column:
set hg19['chr1:000000001']['cons']=0;
set hg19['chr1:000000002']['cons']=2;
set hg19['chr1:000000003']['cons']=3;
set hg19['chr1:000000004']['cons']=13;
set hg19['chr1:000000005']['cons']=4;
Query:
get hg19 where isExon = T and cons > 3;
We should get:
-------------------
RowKey: chr1:000000005
=> (column=cons, value=4, timestamp=1309427028936000)
=> (column=isExon, value=T, timestamp=1309424263233000)
-------------------
RowKey: chr1:000000004
=> (column=cons, value=13, timestamp=1309427026753000)
=> (column=isExon, value=T, timestamp=1309424263230000)
In the final script we will create column family 'hg18' of type super. What is super type? It allows for having super columns. WTF is super column? Easy, it is map of columns. Let's look at the example script:
drop column family hg18;
create column family hg18 with
column_type = Super
and comparator = UTF8Type
and subcomparator = UTF8Type
and key_validation_class = UTF8Type
and column_metadata =[
{column_name:isExon, validation_class:UTF8Type},
{column_name:cons, validation_class:IntegerType},
]
;
set hg18['chr1:000000004'][geneFeatures][isExon] = T;
set hg18['chr1:000000004'][conservation][cons] = 13;
Query:
list hg18;
Result:
-------------------
RowKey: chr1:000000004
=> (super_column=conservation,
(column=cons, value=13, timestamp=1309435176079000))
=> (super_column=geneFeautres,
(column=isExon, value=T, timestamp=1309435176073000))
1 Row Returned.
In summary we created a map (Keyspace: 'Genomics') of maps (Column families: eg 'hg18') of maps (Super columns: eg 'geneFeatures') of maps (Columns: eg isExon) which is semi structured (Column_metadata).
That's it. The idea is very simple.
poniedziałek, 9 maja 2011
Installation of Hadoop and HBase in standalone mode
Manual installation of Hadoop is perfectly described on Michael Noll's blog:
Link
However, if we are going to install HBase a problem occurs. Particular HBase versions are incompatible with particular Hadoop versions. For example, Hadoop 0.21 doesn't work with HBase 0.90. The problem is described here:
Link
Therefore, it is much easier to use ready packages distributed by Cloudera. Personally, I enjoy the Ubuntu Linux. Thus, all commands come from this system. Installation process on different platforms is described on Cloudera's web page:
Link
First, add Cloudera repository to apt:
Second, install curl:
Add Cloudera key to trusted keys:
Update:
Install Hadoop:
Finally, install HBase:
Install HBase Master:
Run:
Link
However, if we are going to install HBase a problem occurs. Particular HBase versions are incompatible with particular Hadoop versions. For example, Hadoop 0.21 doesn't work with HBase 0.90. The problem is described here:
Link
Therefore, it is much easier to use ready packages distributed by Cloudera. Personally, I enjoy the Ubuntu Linux. Thus, all commands come from this system. Installation process on different platforms is described on Cloudera's web page:
Link
First, add Cloudera repository to apt:
echo "deb http://archive.cloudera.com/debian maverick-cdh3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3 contrib" | sudo tee /etc/apt/sources.list.d/cloudera.list
Second, install curl:
sudo apt-get install curl
Add Cloudera key to trusted keys:
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
Update:
sudo apt-get update
Install Hadoop:
sudo apt-get install hadoop-0.20
Finally, install HBase:
sudo apt-get install hadoop-hbase
Install HBase Master:
sudo apt-get install hadoop-hbase-master
Run:
hbase shell
Subskrybuj:
Posty (Atom)