INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
-
technology from seed

kdbio

Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa
Home
 
 

A Linear Time Biclustering Algorithm for Time Series Genomic Expression Data

02/24/2005 - 16:30
02/24/2005 - 17:30
Etc/GMT

Recent developments in DNA chips now enable the simultaneous measure of the expression level of a large number of genes (sometimes all the genes of an organism) for a given experimental condition. Most commonly, gene expression data is arranged in a data matrix, where each gene corresponds to one row and each condition to one column. The conditions may correspond to different time points, different environmental conditions, different organs or different individuals. Simply visualizing this kind of data is challenging. Using it to extract biologically relevant knowledge is even harder.

Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources.

In this talk, we describe a particular setting of the problem, where we are concerned with finding biclusters in time series expression data, and present a linear time biclustering algorithm to achieve this goal.

When analyzing time series expression data, with the goal of isolating coherent activity between genes in a subset of conditions, it is reasonable to restrict the attention to biclusters with contiguous columns. We support this view by assuming that the activation of a set of genes under specific conditions corresponds to the activation of a particular biological process. As time goes on, biological processes start and finish, leading to increased (or decreased) activity of sets of genes that can be identified because they form biclusters with contiguous columns. In this setting, we are interested in finding biclusters where the columns are consecutive in time. For this particular version of the problem, we propose an algorithm that finds and reports all relevant biclusters in time linear on the size of the data matrix. This impressive reduction in complexity is obtained by manipulating a discretized version of the data matrix and by using advanced string manipulation techniques based on suffix trees.

The talk will give a short introduction to biclustering and suffix trees, present the biclustering algorithm and show results in synthetic data and preliminary results on a real biological data set from Yeast, that show the effectiveness of the approach.