ADAM 0.10.0 is now available.
Parquet 1.4.3 upgrade
As developers, you need to know about a small (but significant) change to way Parquet projections work now. In the past, any record fields that were not in the projections were returned as null values. With our upgrade to Parquet 1.4.3, fields that are not in the your projections will be set to their default value (if one exists).
BQSR Performance and Concordance
This release adds the cycle covariate to BQSR. Our BQSR implementation can process the 1000g NA12878 High-Coverage Whole Genome in under 45 minutes using 100 EC2 nodes. Our concordance with GATK is high (>99.9%) for the data we’ve tested. We’ll publish the scripts we use for concordance testing in an upcoming release to allow broader, automated testing. Additionally, ADAM now processes the OQ tag to make comparing original quality scores easier.
Lots of other bug fixes and features…
- ISSUE 242: Upgrade to Parquet 1.4.3
- ISSUE 241: Fixes to FASTA code to properly handle indices.
- ISSUE 239: Make ADAMVCFOutputFormat public
- ISSUE 233: Build up reference information during cigar processing
- ISSUE 234: Predicate to filter conversion
- ISSUE 235: Remove unused contiglength field
- ISSUE 232: Add
- ISSUE 230: Remove duplicate mdtag field
- ISSUE 231: Helper scripts to run an ADAM Console.
- ISSUE 226: Fix ReferenceRegion from ADAMRecord
- ISSUE 225: Change Some to Option to check for unmapped reads
- ISSUE 223: Use SparkConf object to configure SparkContext
- ISSUE 217: Stop using reference IDs and use reference names instead
- ISSUE 220: Update SAM to ADAM conversion
- ISSUE 213: BQSR updates
Frank Nothaft gave a talk at the Strata Big Data Science Meetup where he discussed the overall architecture and performance characteristics of ADAM.
Frank is a graduate student in the AMP and ASPIRE labs at UC-Berkeley, and is advised by Prof. David Patterson. His current focus is on high performance computer systems for bioinformatics, and is involved in the ADAM, avocado, and FireBox projects.
As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact.
The Tim O’Reilly post, “Work on Stuff that Matters: First Principles”, is a great read that encourages readers to work on something that matters to you more than money, create more value than you capture and take the long view. While ADAM is a very new project, we hope that it will be used someday soon to help save lives and enable researchers to ask questions that aren’t possible today. We are taking the long view and sharing all this valuable software as Apache-licensed open-source.
If you’re interested in contributing, join our mailing list and introduce yourself. We also mark open issues that are good introductory projects with a “pick me up!” label. Of course, we welcome any contribution you’d like to make!
Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access.
To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system. This combination provides the following advantages:
- Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby;
- Parquet allows access by database systems like Impala and Shark; and
- Spark improves performance through in-memory caching and reducing disk I/O.
In addition to improving the format’s cross-platform portability, these changes lead to significant performance improvements. On a single node, we are able to speedup sort and duplicate marking by 2×. More importantly, on a 250 Gigabyte (GB) high (60×) coverage human genome, this system achieves a 50× speedup on a 100 node computing cluster (see Table 1), fulfilling the promise of scalability of ADAM.
You can read the full tech report at http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.html