https://github.com/bigdatagenomics/    https://twitter.com/bigdatagenomics/

ADAM 0.14.0 is now available. Special thanks to Arun Ahuja, Timothy Danford, Michael L Heuer, Uri Laserson, Frank Nothaft, Andy Petrella and Ryan Williams for their contributions to this release!

This release uses the newly-released Apache Spark 1.1.0 which brings operational and performance improvements in Spark core. Two new scripts, adam-shell and adam-submit, allow you to use ADAM via the Spark shell or the Spark submit script in addition to the ADAM CLI.

The Hadoop-BAM team is now publishing their artifacts to Maven Central (yea!) so we no longer rely on snapshot releases. ADAM 0.14.0 uses the 7.0.0 release of Hadoop-BAM.

This release also adds a new Java plugin interface, improves MD tag processing as well as fixes numerous bugs.

We hope that you enjoy this release. Drop by #adamdev on freenode.net, follow us on Twitter or subscribe to our mailing list to stay in touch.

ADAM 0.13.0 is now available!

This release includes genome visualization to view aligned reads and coverage information over a reference region. You simply run e.g. adam viz myreads.adam chr1 from the ADAM source directory and open your favorite web browser to http://localhost:8080/ to view your data.

This release also includes a number of features and bug fixes including upgrading to Spark 1.0.1.

ADAM 0.12.0 is now available!

This release includes new Parquet utilities that are part of an effort to read/write Parquet directly on S3, eliminating the need to transfer data from S3 to HDFS for processing. This release also upgrades ADAM to Spark 1.0 and provides new schema definitions, bug fixes and features:

  • ISSUE 264: Parquet-related Utility Classes
  • ISSUE 259: ADAMFlatGenotype is a smaller, flat version of a genotype schema
  • ISSUE 266: Removed extra command ‘BuildInformation’
  • ISSUE 263: Added AdamContext.referenceLengthFromCigar
  • ISSUE 260: Modifying conversion code to resolve #112.
  • ISSUE 258: Adding an ‘args’ parameter to the plugin framework.
  • ISSUE 262: Adding reference assembly name to ADAMContig.
  • ISSUE 256: Upgrading to Spark 1.0
  • ISSUE 257: Adds toString method for sequence dictionary.
  • ISSUE 255: Add equals, canEqual, and hashCode methods to MdTag class

ADAM 0.11.0 is now available.

This release allows you not just read but also write to SAM/BAM files, adds utilities for trimming reads, implements contig-to-RefSeq translation, refactors SequenceDictionary to include RefSeq information (and without numeric IDs) and prepare ADAMGenotype for incorporating reference model information, and fixes a bug in FASTA fragments.

For details see the following issues…

  • ISSUE 250: Adding ADAM to SAM conversion.
  • ISSUE 248: Adding utilities for read trimming.
  • ISSUE 252: Added a note about rebasing-off-master to CONTRIBUTING.md
  • ISSUE 249: Cosmetic changes to FastaConverter and FastaConverterSuite.
  • ISSUE 251: CHANGES.md is updated at release instead of per pull request
  • ISSUE 247: For #244, Fragments were incorrect order and incomplete
  • ISSUE 246: Making sample ID field in genotype nullable.
  • ISSUE 245: Adding ADAMContig back to ADAMVariant.
  • ISSUE 243: Rebase PR#238 onto master


This short screencast is meant to get someone new to Scala, IntelliJ, and the Big Data Genomics stack up and running with a configured development environment suitable for working with or on projects like ADAM and Avocado.

We’ll walk you through downloading the appropriate JDK, IntelliJ IDE, and plugings. Then we will set up the project (using ADAM as the example), generating sources, packaging the application, and building the project. Finally, we cover running tests, as well as some basic exploration and code navigation using the IDE.


Note, if you have trouble using mvn package in the command line, you may want to add the following to your .bashrc, or at least export these environment variables before running mvn package:

export MAVEN_OPTS="-Xmx512m -XX:MaxPermSize=128M"
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Links

ADAM 0.10.0 is now available.

Parquet 1.4.3 upgrade

As developers, you need to know about a small (but significant) change to way Parquet projections work now. In the past, any record fields that were not in the projections were returned as null values. With our upgrade to Parquet 1.4.3, fields that are not in the your projections will be set to their default value (if one exists).

BQSR Performance and Concordance

This release adds the cycle covariate to BQSR. Our BQSR implementation can process the 1000g NA12878 High-Coverage Whole Genome in under 45 minutes using 100 EC2 nodes. Our concordance with GATK is high (>99.9%) for the data we’ve tested. We’ll publish the scripts we use for concordance testing in an upcoming release to allow broader, automated testing. Additionally, ADAM now processes the OQ tag to make comparing original quality scores easier.

Lots of other bug fixes and features…

  • ISSUE 242: Upgrade to Parquet 1.4.3
  • ISSUE 241: Fixes to FASTA code to properly handle indices.
  • ISSUE 239: Make ADAMVCFOutputFormat public
  • ISSUE 233: Build up reference information during cigar processing
  • ISSUE 234: Predicate to filter conversion
  • ISSUE 235: Remove unused contiglength field
  • ISSUE 232: Add -pretty and -o to the print command
  • ISSUE 230: Remove duplicate mdtag field
  • ISSUE 231: Helper scripts to run an ADAM Console.
  • ISSUE 226: Fix ReferenceRegion from ADAMRecord
  • ISSUE 225: Change Some to Option to check for unmapped reads
  • ISSUE 223: Use SparkConf object to configure SparkContext
  • ISSUE 217: Stop using reference IDs and use reference names instead
  • ISSUE 220: Update SAM to ADAM conversion
  • ISSUE 213: BQSR updates

Frank Nothaft gave a talk at the Strata Big Data Science Meetup where he discussed the overall architecture and performance characteristics of ADAM.

Frank is a graduate student in the AMP and ASPIRE labs at UC-Berkeley, and is advised by Prof. David Patterson. His current focus is on high performance computer systems for bioinformatics, and is involved in the ADAM, avocado, and FireBox projects.

ADAM was featured in an article that was posted to both the O’Reilly Strata and Forbes web sites that opens with the following:

As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact.

The Tim O’Reilly post, “Work on Stuff that Matters: First Principles”, is a great read that encourages readers to work on something that matters to you more than money, create more value than you capture and take the long view. While ADAM is a very new project, we hope that it will be used someday soon to help save lives and enable researchers to ask questions that aren’t possible today. We are taking the long view and sharing all this valuable software as Apache-licensed open-source.

If you’re interested in contributing, join our mailing list and introduce yourself. We also mark open issues that are good introductory projects with a “pick me up!” label. Of course, we welcome any contribution you’d like to make!