LargeData: czerwca 2012

środa, 13 czerwca 2012

Getting linked exons BED from UCSC tables

Today I needed bigBed file for visualization of transcript positions in a browser which is bigBed and bigWig based. I have found a discussion where Katrina Learned from UCSC Genome Bioinformatics Group posted very usefull script. I am not going to reinvent the wheel.

nano genePredToBed

and paste this

#!/usr/bin/awk -f

#
# Convert genePred file to a bed file (on stdout)
#
BEGIN {
     FS="\t";
     OFS="\t";
}
{
     name=$1
     chrom=$2
     strand=$3
     start=$4
     end=$5
     cdsStart=$6
     cdsEnd=$7
     blkCnt=$8

     delete starts
     split($9, starts, ",");
     delete ends
     split($10, ends, ",");
     blkStarts=""
     blkSizes=""
     for (i = 1; i <= blkCnt; i++) {
         blkSizes = blkSizes (ends[i]-starts[i]) ",";
         blkStarts = blkStarts (starts[i]-start) ",";
     }

     print chrom, start, end, name, 1000, strand, cdsStart, cdsEnd, 0, 
blkCnt, blkSizes, blkStarts
}

We are almost ready to make our bed file:

chmod +x genePredToBed
wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/knownGene.txt.gz
gzip -d knownGene.txt.gz
cat knownGene.txt | ./genePredToBed > known.bed

Now we can use Jim Kent's bedToBigBed and we are done.

Getting GTF from UCSC with proper gene_id

While downloading GTF file (knownGenes or ensemblGenes) from UCSC Table browser an output has one serious issue. Transcript_id = gene_id. And in fact there is no gene_id. Below simple solution is presented for this problem (mm9 genome):

#prerequisities mysql
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf
chmod +x genePredToGtf
sudo ln -s ./genePredToGtf genePredToGtf
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from ensGene;" mm9 | cut -f2- | genePredToGtf file stdin ensGene.gtf