Motif representation and discovery
An important part of gene regulation is mediated by specific proteins, called transcription factors (TF), which influence the transcription of a particular gene by binding to specific sites on DNA sequences, called transcription factor binding sites (TFBS). Such binding sites are relatively short stretches of DNA, normally 5 to 25 nucleotides long. A commonly used representation of TFBS is a position specific scoring matrices (PSSM) which assumes independence of nucleotides in the binding sites. Recently, some works argued in the direction of non-additivity in protein-DNA interactions making a way for more complex models to appear which account for nucleotide interactions. We propose to model TFBS representing nucleotide interactions with consistent k-graph Bayesian networks (where k represents the maximum number of interactions between nucleotides) jointly with a set of features, directly scored from each base sequence, which appear to be relevant for TFBS characterization. The model is flexible to incorporate any set of features scored from base sequences. We consider discriminative learning of such models since it outperforms generative learning in the context of classification with a large set of features.