Summarizing microarray normalized probe expression to gene expression

0 comments
The usual and fundumental result of the analysis of a microarray data experiment is a list of differentially expressed genes, always according to the statistical design and the biological questions asked. However, what we see most of the times in such a list is a set of differentially expressed probes and not a list of differentially expressed genes (according to official HUGO nomenclature for example). Although this final list contains also gene names next to probes, I have been asked a lot of times to provide a list of genes and discard the probes as they are rather confusing the bench biologists. The latter are asking questions of the type "which probe should I believe" or "why the expression in one probe is so much different from the expression of the other probe corresponding to the same gene"?

The answer is not always easy,

t-test with F-test for equality of variance

0 comments
The t-test between two samples/biological conditions is still one of the most common statistical tests used for the detection of differentially expressed genes in microarray studies, even if a lot of specialized tests have been developed during the past 10 years. However, something that many people don't do, either because it is not covered in several microarray data analysis tutorials, or because it rarely affects the results, but still is the correct way to do, is the F-test to check for the equality of variances within samples to be compared, prior to the t-test. This is because the formula for the t-statistic changes if the variances are not assumed equal (Welch's t-test). Thus,

Converting older SCARF format to FASTQ

0 comments
Recently, I was asked by a collegue if I recognize the following raw data format coming from a quite old dataset which came out from the first next generation sequencers and relatively old software which was used for base calling:


203K0:1:1:626:335:ATTCCATTCCATTCCATTCCATTCCATTCCAT:[[[[[[[[[[[[[[[[[[[[UUUUUUUUOUUU
203K0:1:1:119:614:TAAAAACTAGATAGAAGCAATGTCAGAACTTT:[[[[[[[[[[[[[[W[[[[[UUUUUUUUUUUU
203K0:1:1:114:772:TCCTAGCTAGTTCCCTGCAGCTTTTTATTAAC:[[[[[[[[[[[[[[[[[[WWUUUUUUUCIUUU
203K0:1:1:490:490:GTTGGTGCTTAAAAGTCTTGGATTTTGAAACA:[[[[[[[[[[[[[[W[[[[[UUUUUUOOIUUU

Is Bioinformatics really hard?

2 comments
I would like to share some thoughts that came to my mind today, after a specific event having to do with bioinformatics training. First of all, don't get confused by the title of the post. It might sound like a selfish, elitistic or even racist comment! No, it has nothing to do with selfishness... Let me explain.

During my MSc in Bioinformatics, I met four kinds of people.

  1. Biologists or other people coming from life sciences and bench work that wanted either to switch to bioinformatics or to get basic training but failed to do so because of their fear to sit down and face the evil "black screen" of a Unix command line, let alone other hierarchically lower demons such as basic statistics, or "for" loops, or hierarchically higher demons such as R, algorithmics, basic Perl etc.
  2. Computer scientists and/or mathematicians (like myself) that wanted to apply their background knowledge to a more "practical" and at the same time still scientific level, than predicting algorithm complexities, solving partial differential equations or wandering inside Banach spaces. However, the "application" turned out to be quite difficult as the the types of RNA, the thousands of genes with strange names and those blurry gel images seemed more noisy than an elegant solution of a differential equation or one more O(nlogn) algorithm optimization.
  3. People of the first kind that made it.
  4. People of the second kind that made it.

Creating stranded signal tracks for the UCSC Genome Browser

2 comments

Recently, I was asked by one of my collegues if there was a way to display stranded wiggle signal files. A couple of years ago I would say that it is possible, however quite messy, as the only way to display stranded wiggle signal files was to split the original genomic co-ordinate file (BED, SAM) per strand and then create two separate wiggle (or bigWig) tracks. This was happening because the wiggle (and later bedGraph) specification does not allow for overlapping signals. Now it is much clearer (slighlty more complicated though) by using the ability of the UCSC Genome Browser to overlay tracks, by creating a super/parent tracks and assigning children tracks to it. Unfortunately, this ability is only possible with track hubs.
In this post, I will show you how to create stranded wiggle (or in this example, bigWig) files by setting up a track hub that will host your signal files and which you can upload to the genome browser.

Adding permanent custom annotation tracks to a local UCSC Genome Browser installation

1 comments
As promised in the previous post, in this one, I will show how can one "hack" into the UCSC Genome Browser (local) MySQL database in order to create permanent tracks (annotation, signal or other). One way is of course the track hubs, but they are not permanent in the sense that the user has to load them at least once before start working.
The concept is quite simple and follows the strategy that UCSC programmers use to host ENCODE tracks. In fact, all ENCODE signal tracks are not separate co-ordinate tables in the database structure but they are external bigWig and/or .bam files, stored somewhere (genome) specific in the /gbdb directory. The relative path is then stored in a table in the genome browser database. Please see this post for an explanation of the genome browser directory structure.
So before reading and executing the following MySQL script, you should first create the signal or annotation files. I am going to use the same annotation tracks described in this post, so make sure you have created the .bam files first by reading the post.
We have to touch 2 tables and create as many new ones as our tracks, so the steps are summarized below:

Creating track hubs for the UCSC genome browser with BAM files

0 comments
One of the projects that I am currently involved deals with the detection of novel long non-coding RNAs regulating the WNT signaling pathway in colon cancer. Since we have had our own local installation of the UCSC genome browser for some time now (and it makes a huge difference), I decided that it would be useful to gather some current knowledge about annotated human and mouse lincRNAs from a few sources. These sources are:

The NONCODE project contains lincRNAs mapped to the hg19 and mm10 genomes while the Broad institute lincRNA catalogue, lincRNAs mapped to hg19. As our current browser installation has also mm9 and hg18, we will have to use the liftOver tool (in command line) and the transformation chain files to adjust the coordinates.

Installation of UCSC Genome Browser in a local server

2 comments

A. Preface and MySQL configuration


The UCSC Genome Browser is one of the most essential tools in genomics research. Its value is ever increasing, proportionally to the current explode in available Next Generation Sequencing data. Its installation is not something mainstream and requires a lot of patience and a little more than basic knowledge of Linux environment and MySQL. Before you try it, make sure that you know how to install linux packages (and also from source), how to perform a basic MySQL and Apache setup and how to run Perl and Shell scripts. This guide is not exaclty a step by step procedure as it refers a lot of times to external sources, blogs and wikis found around the web. Based on the work of others, I tried to install an as customizable as possible version on my server, to be used by several labs at the institution I am currently working in.

My installation is performed on an Ubuntu 12.04 LTS Server. You can adjust it for your distribution. Throughout this guide we assume that our base storage environment is /media/HD2/, so you will see a lot of time the shell variable $STORAGE="/media/HD2". We also assume a temporary directory, $TEMP, by default the /tmp directory

Welcome to another bioinformatics blog

0 comments

Welcome to another bioinformatics blog which is my first attempt to finally start sharing advice and pieces of code for everyday computational work, which is something that many others have done so wonderfullty before me.

According to Wikipedia, bioinformatics is "...an interdisciplinary field that develops and improves upon methods for storing, retrieving, organizing and analyzing biological data..." and main activities include "...to develop software tools to generate useful biological knowledge..". For me, Bioinformatics has been and remains an inspiring motivation to apply what I have learned my first scientific discipline (Applied Mathematics) to something more "real" than all the theoretical and background knowledge I got during my studies. Others might choose finance and industry to do this. For me the choice was to switch fields (with lots of consequences, the very first being an immediate background knowledge gap) and try to work with biologists. This interaction for some years now (MSc, PhD etc.) has given me a lot of extra knowledge and the opportunity to apply what I learned during my first studies to a quickly expanding and challenging field.

For a long time now, I have been reading postĪƒ in biology and bioinformatics blogs and forums and using very useful things posted there, including pieces of code and a lot of advice. Now it's time that I start doing it too, with a lot of delay! I am pretty sure that I won't really offer anything useful out there, as to this end, there are plenty of smart people for a long time now, maintaining software, blogs and forums. However, it will finally provide a little bit more the feeling of sharing, as Bioinformatics is a discipline where by default, the majority of tools and algorithms are public and open-source.

So keep posted, and I hope that some day you will find something useful in this blog, something that will make you think that this guy has helped me a bit to my work by adding a very small stone to the pileup.
Copyright © Bioinformatics dance