Blasted Bioinformatics!?: 2011

2011-12-13

Validating ID via Gravatar

Most people will have seen a Gravatar user icon online, short for the rather grand sounding "Globally Recognized Avatar". For example GitHub.com and StackOverflow use them, and many blog platforms uses them for user comments (sadly Blogger doesn't, yet). To get a user's icon, you construct a URL with the MD5 checksum of their email address - and if the user isn't registered you get default image or a unique generated abstract icon. This means you can cross-reference a list of email address with a list of Gravatar icon URLs (i.e. a list of email MD5 checksums).

Is IonTorrent open or not?

It seems IonTorrent are trying to present themselves as the open democratising sequencing platform for high throughput sequencing, with their Ion Community, sample datasets and (in theory) open source software. That sounds great and much more open than Roche 454 or Illumina, but I don't think they're doing a very good job of it - apparently they're even managing to break the GPL (see below).

Update 14 Dec: See comments, Ion Torrent Suite v2 should be coming to GitHub in January under the GPL v2 - that counts as open in my book :)

Update 23 Jan: As planned, the Ion Torrent Suite is on GitHub under the GPL v2. Nice!

Random access to BZIP2?

In my last post I looked at how the GZIP variant BGZF (Blocked GNU Zip Format, used in BAM files) allowed efficient random access to large compressed files. This time I'm looking at bzip2 (bz2) which offers better compression than GZIP, but is also block based so in theory the same random access strategy can be employed.

BGZF - Blocked, Bigger & Better GZIP!

BAM files are compressed using a variant of GZIP (GNU ZIP), called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specification will have seen the terms BGZF and virtual offsets, but what you may not realise is how general purpose this is for random access to any large compressed file. The take home message is:

BGZF files are bigger than GZIP files, but they are much faster for random access.

FASTQ must die! Long live SAM/BAM!

I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format. I will try to explain, as this may not immediately strike everyone as logical, given SAM/BAM is primarily a sequence alignment/mapping format, while for "raw" reads FASTQ is near ubiquitous in Next Generation Sequencing (NGS), more sensibly known as High Throughput Sequencing (HTS).

SAM/BAM without gapped reference

In my last post I talked about SAM/BAM with a gapped reference, and how this makes it much easier to work with inserted bases relative to the reference/consensus - especially for visualisation.

I should point out that some viewers do actually manage to show the inserts as columns even with the traditional ungapped/unpadded reference sequence - notably Gap5, Bambino, and the text based samtools tview, as shown in these tview screenshots. You press the "i" key to toggle this insert display, press "?" for help.

SAM/BAM with gapped reference

A lot of my time this week has gone into thinking and "talking" on the samtools-devel mailing list about the SAM/BAM file format and how it might be improved for (de novo) assemblies.

Why are NCBI GFF3 files still broken?

For the early part of my career in Bioinformatics I was able to avoid GFF3 files - initially I focused on finished annotated genomes from the NCBI in plain text GenBank format (which has complications of its own), but with genome sequencing becoming widespread, so too is genome assembly and annotation. And for this, you will have to learn about GFF3 files.

Opening up NCBI BLAST?

The BLAST chapter of the Biopython Tutorial (PDF) starts with these lines by Brad Chapman,

Hey, everybody loves BLAST right? I mean, geez, how can get it get any easier to do comparisons between one of your sequences and every other sequence in the known world?

I know what he meant - but it turns out things could be easier, especially once you start running "standalone BLAST" on your own machines, rather than using the NCBI's ever improving BLAST website. Part of the problem is setting up BLAST and its databases can be complicated (especially on a cluster), but also inevitably, BLAST has bugs.

This isn't a slight on the NCBI, any non-trivial software product will have bugs. I'm more concerned with how they are dealt with.

Blasted Bioinformatics!?

2011-12-13

Validating ID via Gravatar

2011-12-12

Is IonTorrent open or not?

2011-11-22

Random access to BZIP2?

2011-11-08

BGZF - Blocked, Bigger & Better GZIP!

2011-10-21

FASTQ must die! Long live SAM/BAM!

2011-10-03

SAM/BAM without gapped reference

2011-09-22

SAM/BAM with gapped reference

2011-08-15

Why are NCBI GFF3 files still broken?

2011-08-11

Opening up NCBI BLAST?