Summarizing microarray normalized probe expression to gene expression

0 comments
The usual and fundumental result of the analysis of a microarray data experiment is a list of differentially expressed genes, always according to the statistical design and the biological questions asked. However, what we see most of the times in such a list is a set of differentially expressed probes and not a list of differentially expressed genes (according to official HUGO nomenclature for example). Although this final list contains also gene names next to probes, I have been asked a lot of times to provide a list of genes and discard the probes as they are rather confusing the bench biologists. The latter are asking questions of the type "which probe should I believe" or "why the expression in one probe is so much different from the expression of the other probe corresponding to the same gene"?

The answer is not always easy,

t-test with F-test for equality of variance

0 comments
The t-test between two samples/biological conditions is still one of the most common statistical tests used for the detection of differentially expressed genes in microarray studies, even if a lot of specialized tests have been developed during the past 10 years. However, something that many people don't do, either because it is not covered in several microarray data analysis tutorials, or because it rarely affects the results, but still is the correct way to do, is the F-test to check for the equality of variances within samples to be compared, prior to the t-test. This is because the formula for the t-statistic changes if the variances are not assumed equal (Welch's t-test). Thus,

Converting older SCARF format to FASTQ

0 comments
Recently, I was asked by a collegue if I recognize the following raw data format coming from a quite old dataset which came out from the first next generation sequencers and relatively old software which was used for base calling:


203K0:1:1:626:335:ATTCCATTCCATTCCATTCCATTCCATTCCAT:[[[[[[[[[[[[[[[[[[[[UUUUUUUUOUUU
203K0:1:1:119:614:TAAAAACTAGATAGAAGCAATGTCAGAACTTT:[[[[[[[[[[[[[[W[[[[[UUUUUUUUUUUU
203K0:1:1:114:772:TCCTAGCTAGTTCCCTGCAGCTTTTTATTAAC:[[[[[[[[[[[[[[[[[[WWUUUUUUUCIUUU
203K0:1:1:490:490:GTTGGTGCTTAAAAGTCTTGGATTTTGAAACA:[[[[[[[[[[[[[[W[[[[[UUUUUUOOIUUU

Is Bioinformatics really hard?

2 comments
I would like to share some thoughts that came to my mind today, after a specific event having to do with bioinformatics training. First of all, don't get confused by the title of the post. It might sound like a selfish, elitistic or even racist comment! No, it has nothing to do with selfishness... Let me explain.

During my MSc in Bioinformatics, I met four kinds of people.

  1. Biologists or other people coming from life sciences and bench work that wanted either to switch to bioinformatics or to get basic training but failed to do so because of their fear to sit down and face the evil "black screen" of a Unix command line, let alone other hierarchically lower demons such as basic statistics, or "for" loops, or hierarchically higher demons such as R, algorithmics, basic Perl etc.
  2. Computer scientists and/or mathematicians (like myself) that wanted to apply their background knowledge to a more "practical" and at the same time still scientific level, than predicting algorithm complexities, solving partial differential equations or wandering inside Banach spaces. However, the "application" turned out to be quite difficult as the the types of RNA, the thousands of genes with strange names and those blurry gel images seemed more noisy than an elegant solution of a differential equation or one more O(nlogn) algorithm optimization.
  3. People of the first kind that made it.
  4. People of the second kind that made it.

Creating stranded signal tracks for the UCSC Genome Browser

2 comments

Recently, I was asked by one of my collegues if there was a way to display stranded wiggle signal files. A couple of years ago I would say that it is possible, however quite messy, as the only way to display stranded wiggle signal files was to split the original genomic co-ordinate file (BED, SAM) per strand and then create two separate wiggle (or bigWig) tracks. This was happening because the wiggle (and later bedGraph) specification does not allow for overlapping signals. Now it is much clearer (slighlty more complicated though) by using the ability of the UCSC Genome Browser to overlay tracks, by creating a super/parent tracks and assigning children tracks to it. Unfortunately, this ability is only possible with track hubs.
In this post, I will show you how to create stranded wiggle (or in this example, bigWig) files by setting up a track hub that will host your signal files and which you can upload to the genome browser.

Adding permanent custom annotation tracks to a local UCSC Genome Browser installation

1 comments
As promised in the previous post, in this one, I will show how can one "hack" into the UCSC Genome Browser (local) MySQL database in order to create permanent tracks (annotation, signal or other). One way is of course the track hubs, but they are not permanent in the sense that the user has to load them at least once before start working.
The concept is quite simple and follows the strategy that UCSC programmers use to host ENCODE tracks. In fact, all ENCODE signal tracks are not separate co-ordinate tables in the database structure but they are external bigWig and/or .bam files, stored somewhere (genome) specific in the /gbdb directory. The relative path is then stored in a table in the genome browser database. Please see this post for an explanation of the genome browser directory structure.
So before reading and executing the following MySQL script, you should first create the signal or annotation files. I am going to use the same annotation tracks described in this post, so make sure you have created the .bam files first by reading the post.
We have to touch 2 tables and create as many new ones as our tracks, so the steps are summarized below:

Creating track hubs for the UCSC genome browser with BAM files

0 comments
One of the projects that I am currently involved deals with the detection of novel long non-coding RNAs regulating the WNT signaling pathway in colon cancer. Since we have had our own local installation of the UCSC genome browser for some time now (and it makes a huge difference), I decided that it would be useful to gather some current knowledge about annotated human and mouse lincRNAs from a few sources. These sources are:

The NONCODE project contains lincRNAs mapped to the hg19 and mm10 genomes while the Broad institute lincRNA catalogue, lincRNAs mapped to hg19. As our current browser installation has also mm9 and hg18, we will have to use the liftOver tool (in command line) and the transformation chain files to adjust the coordinates.
Copyright © Bioinformatics dance