INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
-
technology from seed

kdbio

Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa
Home
 
 

Seminars

Design and Implementation of a Domain Specific Language for Next Generation Sequence Analysis

02/06/2014 - 14:30
02/06/2014 - 15:30
Etc/GMT

Next Generation Sequecing (NGS) is a set of molecular biology technologies
which generate, at low cost, many millions of short nucleotide reads. Typical
datasets consist of tens of millions of reads, with each read comprising 35-500
basepairs (depending on the technology used, different read sizes can be
obtained).

There are many tools for handing these datasets. However, they must still be
combined to build a full analysis pipeline. Current solutions to build these
pipelines are Make-like tools which can handle text-files and Unix-like
commands. Several GUI-based solutions allow users who are not comfortable with
the command line to build and run these pipelines. However, they still operate
at the semantic level of Make: file dependencies and transformation commands.

Because each problem and each variation on the technology requires a
different processing pipeline, it would be impossible to design a single
pipeline for every need. This paper aims at the description of a context aware tool
that will allow for the first phase of NGS analysis.

Evaluating differential gene expression using RNA-sequencing data

11/28/2013 - 14:30
11/28/2013 - 15:30
Etc/GMT

Unlike the genome, the cell transcriptome is dynamic and specific for a given cell developmental stage or physiological condition. Understanding the transcriptome is essential for interpreting the functional elements of the genome and revealing the molecular constituents of cells. Recently, developments of high-throughput DNA sequencing methodologies have provided a new method to sequence RNA at unprecedented high resolutions. This method is termed RNA-Seq and has been emerging as the preferred technology for both characterization and quantification of the cell transcripts.

Bearing this in mind, in this thesis I propose a bioinformatics pipeline to compare two RNA-Seq samples. This pipeline permits biological insight into the analysed samples, by extracting the main biological processes that are differentially active among the samples in analysis. Subsequent to this pipeline, I developed a novel methodology to inspect the activation of a given cellular pathway in a time-course RNA-Seq dataset.

The evaluation of a Listeria monocytogenes RNA-Seq dataset with the developed tools testified its proper functioning. It was possible to identify global changes in the human host transcriptome and associate these changes to different stages of the Listeria monocytogenes infection lifecycle.

MetaGen-FRAME

10/31/2013 - 14:30
10/31/2013 - 15:30
Etc/GMT

Metagenomics is the study of metagenomes, unprocessed genetic material residing in the most varied
sites, without separation into individual organisms. Metagenomic approaches to the study of biological
communities are quickly changing our understanding of the function and inter-relationships among
living organisms in ecosystems. The rapid advances in metagenomics are largely due to the hasty development
of high throughput platforms for deoxyribonucleic acid (DNA) sequencing, that need to be
accompanied by significant advances in data analysis techniques.
With this work, I intended to develop and apply new techniques for data analysis that can be applied
to large amounts of data generated by metagenomics. This document presents a proposal to address the
challenges posed by the storage and manipulation of such information types and the need to develop
new data analysis techniques that can be applied directly to this problem. For this purpose, there was
an intention to harness the power of parallel computing.
The target-result of this thesis was MetaGen-FRAME, a metagenomic framework capable of handling
heterogeneous data types (from DNA sequences to genome, proteome and metabolome annotations)
though the use of different data structures and computational approaches.

On Multi-class Classification Problems Using Genetic Programming

10/24/2013 - 14:30
10/24/2013 - 15:30
Etc/GMT

Genetic Programming (GP) is a field under the hood of Evolutionary
Computing, that has been successful in addressing a variety of
problems in the field of data mining and machine learning,
notexcluding the problems of multi-class classification
(mcc). However, its realms have been successful only in extending the
binary GP classifiers to the problems of mcc, thereof still retaining
a void of not having any efficient multi-class classifiers, when
compared to non-GP classifiers. In this work, I will present a novel
algorithm that incorporates some ideas on the representation of the
solution space for a tree based GP, that will lay some foundations on
filling this void, which might also lead to some future research in
this direction. During the presentation, I shall reveal the success
and competitiveness of this approach, and discuss about the future
directions.

Quick Hyper-Volume

10/10/2013 - 14:30
10/10/2013 - 15:30
Etc/GMT

I will present a new algorithm to calculate exact hypervolumes. Given
a set of $d$-dimensional points, it computes the
hypervolume of the dominated space. Determining this value is an
important subroutine of Multiobjective Evolutionary Algorithms
(MOEAs). We analyze the ``Quick Hypervolume'' QHV algorithm
theoretically and experimentally. The theoretical results are
a significant contribution to the current state of the art. Moreover
the experimental performance is also very competitive, compared
with existing exact hypervolume algorithms.

Parallel efficient alignment of reads for re-sequencing applications

09/26/2013 - 14:30
09/26/2013 - 15:30
Etc/GMT

In bioinformatics, in the context of resequencing projects,
the e cient and accurate mapping of reads to a reference
genome is a critical problem. One instance of this problem
is the local alignment of pyrosequencing reads produced
by the 454 GS FLX system against a reference sequence,
an instance for which the software tool TAPyR (Tool for
the Alignment of Pyrosequencing Reads) was developed.
TAPyR implements a methodology to e ciently solve this
problem, which proved to yield results of a quality (both in
terms of content and execution speed) higher than those of
mainstream applications. With the goal of further improving
this platform's results, we produced a parallel implementation
of the query and reference sequence access procedures
of the original version. Through the use of multithreading,
this new version, P-TAPyR, produces considerable
reductions in the processing time of queries, scaling with
the amount of hardware-supported threads (not accounting
for hyper-threading) available. For larger data sets, we
were able to observe running times roughly 26 times faster
than serial execution with 30 executing threads, showing
an experimental (progressively-decreasing) execution serial
fraction of 0.8% (determined by the Karp-Rabin Metric described
in a posterior section). Herein we present the modi
cations made to this software tool to allow for parallel
querying of reads against an indexed reference which, scales
proportionally to the amount of available physical cores.

Host-pathogen interaction upon infection with Listeria using NGS techniques

06/07/2013 - 11:00
06/07/2013 - 12:00
Etc/GMT

Listeria monocytogenes is a model bacterial pathogen whose, after internalization, is
capable of disrupting a double-membrane vacuole, replicate in the host cytosol and
manipulate the innate response triggered in the cytosol. Its intracellular lifecycle in the
human host provides insight into the dynamics of general host-pathogen
interactions. The identification of host sequences affected during these interactions is
paramount to our understanding of how pathogens engineer their cellular
environments.
The main goal of this project is, therefore, to comprehend in which way pathogens are
influencing human host cells, by identifying global changes in the host transcriptome
and characterizing the alterations in host nuclear architecture. Furthermore, it is aimed
to associate these changes to different stages of the Listeria monocytogenes infection
lifecycle. For that, total RNA was extracted from three different cell populations at four
time-points (after 20, 60, 120 and 240 minutes) with the purpose of having represented
specific stages in the bacterium lifecycle.

Novel semantic approaches in Genetic Programming.

05/24/2013 - 11:00
05/24/2013 - 12:00
Etc/GMT

Evolutionary algorithms are stochastic optimization techniques based on the
principles of natural evolution and Genetic Programming (GP) belongs to this family .

In recent years the study of GP systems has been extended to phenotypic aspects while in previous phase it was mainly focused on genotypic and syntactic aspects.

Phenotype or semantic is utilized with the aim of optimizing the capacity of GP algorithms to explore the solution space in an effective way, classifying similar individuals and exploring new semantic areas, increasing the probability to find an optimal solution and to escape local optimum.

Currently semantic GP is strictly related to the evaluation of individual's behavior in the candidate population: this kind of evaluation is mainly obtained through the fitness function itself.

This work introduces a new way of measuring semantic similarity between individuals that is more independent from the fitness itself, allowing a fair comparison even when the finesses values involved are very far away from each other. This new measure enable a new series of techniques to be used to tackle the open problems in GP, like bloat and over-fitting, and also targeting the phenotype's variety preservation thereby enhancing performances. Preliminary results will be provided.

A new theoretical GP algorithm based on this new semantic measure it is also introduced showing the potential advantages. Very early results coming from a first naive implementation show interesting insight on this potential comparing with others on the cutting edge algorithms.

Equilibria in a Repeated Epidemic Dissemination Game

05/10/2013 - 11:30
05/10/2013 - 12:00
Etc/GMT

Abstract: "Epidemic dissemination protocols are known to be extremely
scalable and robust. As a result, they are particularly well suited to
support the dissemination of information in large-scale peer-to-peer
systems. In such an environment, nodes do not belong to the same
administrative domain. On the contrary, many of these systems rely on
resources made available by rational nodes that are not necessarily
obedient to the protocol. There are two main incentive mechanisms that
can be used to deal with rational behavior. One is to rely on balanced
exchanges, which is feasible to implement in epidemic protocols where
interactions are symmetric. For the asymmetric case, incentives based on
a monitoring approach are more suited. Unfortunately, the literature
does not provide any meaningful theoretical results for this last type
of incentives. In this talk, I will present basic results that establish

a tradeoff between the amount of information provided by a monitor and
the ability to sustain cooperation among rational nodes, assuming a
perfect monitoring."

Xavier Vilaça is a PhD student at IST and a researcher of Distributed
Systems Group at INESC-ID. He got a MSc degree in Computer Science and
Engineering from IST in 2011 and a BSc also in Computer Science and
Engineering from University of Minho in 2009.

This work is being presented as a final report for the Complex Network
Analysis course from the PhD program in Computer Science and
Engineering at IST.

Novel semantic approaches in Genetic Programming.

04/26/2013 - 11:00
04/26/2013 - 12:00
Etc/GMT

Evolutionary algorithms are stochastic optimization techniques based on the
principles of natural evolution and Genetic Programming (GP) belongs to this family .

In recent years the study of GP systems has been extended to phenotypic aspects while in previous phase it was mainly focused on genotypic and syntactic aspects.

Phenotype or semantic is utilized with the aim of optimizing the capacity of GP algorithms to explore the solution space in an effective way, classifying similar individuals and exploring new semantic areas, increasing the probability to find an optimal solution and to escape local optimum.

Currently semantic GP is strictly related to the evaluation of individual's behavior in the candidate population: this kind of evaluation is mainly obtained through the fitness function itself.

This work introduces a new way of measuring semantic similarity between individuals that is more independent from the fitness itself, allowing a fair comparison even when the finesses values involved are very far away from each other. This new measure enable a new series of techniques to be used to tackle the open problems in GP, like bloat and over-fitting, and also targeting the phenotype's variety preservation thereby enhancing performances. Preliminary results will be provided.

A new theoretical GP algorithm based on this new semantic measure it is also introduced showing the potential advantages. Very early results coming from a first naive implementation show interesting insight on this potential comparing with others on the cutting edge algorithms.