Design and Implementation of a Domain Specific Language for Next Generation Sequence Analysis
Next Generation Sequecing (NGS) is a set of molecular biology technologies
which generate, at low cost, many millions of short nucleotide reads. Typical
datasets consist of tens of millions of reads, with each read comprising 35-500
basepairs (depending on the technology used, different read sizes can be
obtained).
There are many tools for handing these datasets. However, they must still be
combined to build a full analysis pipeline. Current solutions to build these
pipelines are Make-like tools which can handle text-files and Unix-like
commands. Several GUI-based solutions allow users who are not comfortable with
the command line to build and run these pipelines. However, they still operate
at the semantic level of Make: file dependencies and transformation commands.
Because each problem and each variation on the technology requires a
different processing pipeline, it would be impossible to design a single
pipeline for every need. This paper aims at the description of a context aware tool
that will allow for the first phase of NGS analysis.