INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
-
technology from seed

kdbio

Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa
Home
 
 

Fully compressed Sufix Trees

05/08/2008 - 17:00
Etc/GMT

Suffix trees are by far the most important data structure in
stringology, with myriads of applications in fields like
bioinformatics and information retrieval. Classical representations of
suffix trees require O(n \log n) bits of space, for a string of size
n. This is considerably more than the n \log_2\sigma bits needed for
the string itself, where \sigma is the alphabet size. The size of
suffix trees has been a barrier to their wider adoption in practice.
Recent compressed suffix tree representations require just the space
of the compressed string plus \Theta(n) extra bits. This is already
spectacular, but still unsatisfactory when \sigma is small as in DNA
sequences.

In this talk we introduce the first compressed suffix tree
representation that breaks this linear-space barrier. Our
representation requires sublinear extra space and supports a large set
of navigational operations in logarithmic time. An essential
ingredient of our representation is the lowest common ancestor (LCA)
query. We reveal important connections between LCA queries and suffix
tree navigation.