As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact.
The Tim O’Reilly post, “Work on Stuff that Matters: First Principles”, is a great read that encourages readers to work on something that matters to you more than money, create more value than you capture and take the long view. While ADAM is a very new project, we hope that it will be used someday soon to help save lives and enable researchers to ask questions that aren’t possible today. We are taking the long view and sharing all this valuable software as Apache-licensed open-source.
If you’re interested in contributing, join our mailing list and introduce yourself. We also mark open issues that are good introductory projects with a “pick me up!” label. Of course, we welcome any contribution you’d like to make!
Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access.
To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system. This combination provides the following advantages:
- Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby;
- Parquet allows access by database systems like Impala and Shark; and
- Spark improves performance through in-memory caching and reducing disk I/O.
In addition to improving the format’s cross-platform portability, these changes lead to significant performance improvements. On a single node, we are able to speedup sort and duplicate marking by 2×. More importantly, on a 250 Gigabyte (GB) high (60×) coverage human genome, this system achieves a 50× speedup on a 100 node computing cluster (see Table 1), fulfilling the promise of scalability of ADAM.
You can read the full tech report at http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.html