INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


Identification and quantification of reachable attractors over asynchronous discrete dynamics

12/19/2014 - 14:30
12/19/2014 - 15:30

Models of discrete concurrent systems often lead to huge and complex
state transition graphs that represent their dynamics.
Here, we are particularly interested in logical models of biological
regulatory networks. Given an initial condition, it is of real interest
to identify reachable attractors that denote the potential asymptotical
behaviours of the system. These attractors are described as terminal
strongly connected components, that are either single (stable) states or
sets of states (denoting cyclical behaviours).

Beyond attractor identification, we propose to assess the probability to
reach each of them from an initial condition or from any portion of the
state space, relying on the structure of the state transition graph.
First, we present a solution to the problem with an original algorithm
called FIREFRONT, based on the exhaustive exploration of the reachable
state space. Then, for the cases where FIREFRONT is not applicable, we
define a modified Monte Carlo simulation, termed AVATAR.

A data mining approach to study disease presentation patterns in Primary Progressive Aphasia.

12/05/2014 - 14:30
12/05/2014 - 15:30

Nowadays the world is faced with an ageing population and the related challenges, as
healthcare issues given the current incidence of diseases more prevalent in elders, such as
neurodegenerative diseases. Primary Progressive Aphasia (PPA) is a neurodegenerative disease
characterized by a gradual dissolution of language abilities, being these patients regarded with
special attention since they possess higher risk to evolve to dementia. Consequently,
discovering the different subtypes of PPA patients is fundamental to the timely administration
of pharmaceutics and therapeutic interventions, improving patient's quality of life.
This thesis aims to propose a data mining approach to extract relevant knowledge from
clinical data, namely to learn the variants of PPA. Initially, standard clustering algorithms were
applied with the purpose of studying the number of groups existent in the dataset and
eventually, study the potential existence of new groups, different from the PPA subtypes
already defined in the literature. Then, during a second phase, supervised learning techniques
were used to analyze patients according to their clinical classification in one of the three PPA
variants and develop a new and accurate classification model.
The unsupervised learning analysis pointed to the existence of two main groups in the
dataset analyzed in this work. This study included the evaluation of diverse sets of attributes in
order to access which type/set of attributes produced better results. Finally, two new
methodologies for classifying patients with PPA were developed, reaching good accuracies in
the dataset under study. One of those methodologies enables the identification of instances
which are (potentially) not from any of the already defined three PPA subtypes.

Design and Implementation of a Domain Specific Language for Next Generation Sequence Analysis

09/26/2014 - 14:30
09/26/2014 - 15:30

Next Generation Sequecing (NGS) is a set of molecular biology technologies
which generate, at low cost, many millions of short nucleotide reads. Typical
datasets consist of tens of millions of reads, with each read comprising 35-500
basepairs (depending on the technology used, different read sizes can be

There are many tools for handing these datasets. However, they must still be
combined to build a full analysis pipeline. Current solutions to build these
pipelines are Make-like tools which can handle text-files and Unix-like
commands. Several GUI-based solutions allow users who are not comfortable with
the command line to build and run these pipelines. However, they still operate
at the semantic level of Make: file dependencies and transformation commands.

Because each problem and each variation on the technology requires a
different processing pipeline, it would be impossible to design a single
pipeline for every need. This paper aims at the description of a context aware tool
that will allow for the first phase of NGS analysis.

Data integration tools for pre-processing biological data

06/26/2014 - 14:30
06/26/2014 - 15:30

The increasing use of Electronic Health Records (EHRs) enables a better analysis of patient data, improving the quality of medical care. EHRs must be processed in order to provide a variety of services to the physician, such as risk classification and summarization. EHRs usually are stored in unstructured text or Excel files containing different data formats and types, missing information, and, sometimes, inconsistent information. Therefore, before analyzing the data, we often need to transform and integrate it. In this presentation, we show some examples of data integration tools that can be used to extract and transform data. As example, we use an Excel file containing exam information regarding patients with ALS (Amyotrophic Lateral Sclerosis).

The Biodegradation and Surfactants Database

06/12/2014 - 14:30
06/12/2014 - 15:30

The Biodegradation and Surfactants Database (BioSurfDB) is a curated relational information system currently integrating 14 metagenomes, 137 organisms, 73 biodegradation relevant genes, 62 proteins and 6 of their metabolic pathways; 29 documented bioremediation experiments, with specific pollutants treatment efficiencies by surfactant producing organisms; and a 46 biosurfactants curated list, grouped by producing organism, surfactant name and class and reference.

Our goal is to gather published and novel information on the identification and characterization of genes involved in Oil Biodegradation and Bioremediation of polluted environments and provide it in a curated way together with a series of computational tools to aid biology studies.

Integrative biomarker discovery in neurodegenerative diseases: a survey

04/24/2014 - 14:30
04/24/2014 - 15:30

Data mining has been widely applied in biomarker discovery, resulting in
significant findings of different clinical and biological biomarkers. With
developments in technology, from genomics to proteomics analysis, a deluge
of data has become available, as well as standardized data repositories.
Nonetheless, researchers are still facing important challenges in
analyzing the data, especially when considering the complexity of pathways
involved in biological processes or diseases. Data from single sources
seem unable to explain complex processes, such as the ones involved in
brain related disorders, thus rising the need for a more comprehensive
perspective. A possible solution relies on data and model integration,
where several data types are combined to provide complementary views,
which in turn can result in the discovery of previously unknown
biomarkers, by unravelling otherwise hidden relationships between data of
different sources. In this work, we review the different single-source
types of data used for biomarker discovery in neurodegenerative diseases,
and then proceed to provide an overview on recent efforts to perform
integrative analysis in these disorders, discussing major challenges and

Novel metric for the use of Minimum Spanning Trees in phylogenetic trees studies

04/03/2014 - 14:30
04/03/2014 - 15:30

The use of trees for phylogenetic representations started in the
middle of the 19th century. One of their most popular uses is Charles
Darwin's sole illustration in "The Origin of Species" [4]. The
simplicity of the tree representation makes it still the method of
choice today to easily convey the diversification and relationships
between species. Yet trees suffer from several drawbacks that are not
always clear to researchers. Since several different algorithms can be
used to infer and draw the tree, one must be aware of each algorithm's
set of assumptions.
In the analysis of sequence-based microbial typing methods, Minimum
Spanning Trees (MSTs) are becoming the standard for representing
relationships between strains. However, these suffer from several
limitations that can mislead in the interpretation of the resulting
tree. The fact that a single tree is reported from a multitude of
possible and equally optimal solutions and that no statistical metrics
exist to evaluate them, justified a recent heuristic approach to
address these issues.
We present a new edge betweenness metric for undirected and weighted
graphs. This metric is defined as the fraction of minimum spanning
trees where a given edge is present and it was motivated by the
necessity of evaluating phylogenetic trees. Moreover we provide
results and methods concerning the exact computation of this metric
based on the well-known Kirchhoff's matrix tree theorem.

Extracting academic data and linked data anonymization

03/20/2014 - 14:30
03/20/2014 - 15:30

Data is becoming more valuable each day as more diverse and rich
data sources become available, allowing us to discover knowledge
on unprecedented ways.

IST uses FénixEdu information system for managing most of internal
data. The system contains data about students, teachers, employees,
courses, and all major aspects of IST as an organization. Such data
may be useful for both external agents and, more importantly, for IST
itself to study our academic environment. Data may be used as input
for state-of-art IR and KD technologies to extract newer and deeper
knowledge about academic agents allowing to solve problems on and to
understand better our community.

Releasing this kind of data publicly comprises an additional
step in what concerns privacy preserving of referred individuals and,
as has been shown, simple de-identification may not be enough to achieve
such goal. On the other hand we must deal with both internal and
external data, on top of an evolving environment, where linked data
based approaches can definitely help us to deal with such complexity.
In this talk we will discuss a solution for exposing, sharing, and
connecting data, information, and knowledge available on IST information
system, taking into consideration privacy and anonymity issues.

Network mining based analysis of whole brain functional connectivity

03/06/2014 - 14:30
03/06/2014 - 15:30

Mapping the human brain has been a topic of interest for the last few
decades. In spite of its incredible complexity it is now possible to
map the brain using a combination of advanced data representation and
data processing algorithms supported on the huge computational power
that is available nowadays. In this work we describe an approach for
mapping whole-brain functional connectivity. The starting point of our
work is a set of high resolution functional magnetic resonance images
(fMRI) obtained with a 7T magnetic field that cover a wider brain
volume than usual. The fMRIs are then used to build the so called
brain functional connectivity network. These networks extracted from
the brain can be represented as graphs, i.e., a set of nodes (regions)
and a set of edges connecting such nodes. With the networks
represented as graphs we apply network mining techniques to them,
namely clustering and modularity algorithms that allow us, for
instance, to identify functional modules of the brain. Presumably, the
increased resolution will allow to obtain more detailed information
and potential to uncover additional structure. Due to the size of the
graphs all the algorithms must be optimized in order to minimize the
used resources.

Computational prediction of microRNA targets in plant genomes

02/20/2014 - 14:30
02/20/2014 - 15:30

MicroRNAs (miRNAs) are important posttranscriptional regulators and
act by recognizing and binding to sites in their target messenger RNAs
(mRNAs). They are present in nearly all eukaryotes, in particular in
plants, where they play important roles in developmental and stress
response processes by targeting mRNAs for cleavage or translational
repression. MiRNAs have been shown to have a crucial role in gene
expression regulation, but so far only a few miRNA targets in plants
have been experimentally validated. Based on the number of identified
genes, on the number of experimentally validated miRNAs and on the
fact that one miRNA often regulates multiple genes, a long list of yet
unidentified targets is to be expected. Here, we present a novel miRNA
target prediction method for plants, that incorporates an evolutionary
approach. With this approach, we intend to understand whether a
transcript shows evidence of exhibiting a sequence bias towards either
eliciting or avoiding target sites for a particular miRNA.