ADAM version 0.21.0 has been released!
Due to major changes between Spark versions 1.6 and 2.0, we now build for combinations of Apache Spark and Scala versions: Spark 1.x and Scala 2.10, Spark 1.x and Scala 2.11, Spark 2.x and Scala 2.10, and Spark 2.x and Scala 2.11. The Spark 2.x build-time dependency will be bumped to version 2.1.0 in the next release of ADAM, see issue #1330.
One focus of this release was documentation, both at the developer API level, including extensive javadoc and scaladoc source code comments, and at the user level (e.g. https://github.com/bigdatagenomics/adam/tree/master/docs/source). The user docs can be compiled to PDF or HTML with pandoc, but to be honest they look better rendered as Markdown on Github.
Another focus was to more closely follow the VCF specification(s) when reading from and writing to VCF. For this we made significant changes to our variant and variant annotation schema and added support for version 1.0 of the VCF INFO ‘ANN’ key specification. This work will continue for our genotype and genotype annotation schema in the next version of ADAM.
The full list of changes since version 0.20.0 is below.
ADAM version 0.20.0 has been released!
Due to major changes between Spark versions 1.6 and 2.0, we now build for combinations of Apache Spark and Scala versions: Spark 1.x and Scala 2.10, Spark 1.x and Scala 2.11, Spark 2.x and Scala 2.10, and Spark 2.x and Scala 2.11.
Since the last release, version 0.19.0, we have closed more than 180 issues and merged more than 120 pull requests.
We added a new pipe API, allowing for streaming alignment and variant records out to external applications and streaming back in the results. Several new region join implementations are now public API, including a broadcast inner join, broadcast right outer join, sort-merge inner join, sort-merge right outer join, sort-merge left outer join, sort-merge full outer join, sort-merge inner join followed by a group by, and a sort-merge right outer join followed by a group by.
Alignment records can now be read from and written to CRAM format. We updated upstream dependencies on Hadoop-BAM and htsjdk to fix various alignment record header bugs and to add support for gzip and BGZF compressed VCF.
Our sequence feature schema now more closely follow the GFF3 specification, while still supporting BED, GFF2/GTF, IntervalList, and NarrowPeak formats. We also added a new sample schema for e.g. SRA sample metadata.
With this version the core ADAM APIs are undergoing a major refactoring. We changed many method names on ADAMContext to make the API more consistent. We also added RDD wrapper classes to increase performance by serializing metadata (such as record groups, samples, and sequence dictionaries) to disk separate from primary data in Parquet. API incompatibilities between ADAM releases will settle down by the 1.0 release, currently targeted for early 2017.
The full list of changes since version 0.19.0 is below.
The 0.19.0 release contains various concordance fixes and performance improvements for accessing read metadata. Schema changes, including a bump to version 0.7.0 of the Big Data Genomics Avro data formats, were made to support the read metadata performance improvements. Additionally, the performance of exporting a single BAM file was improved, and this was made to be guaranteed correct for sorted data.
ADAM now targets Apache Spark 1.5.2 and Apache Hadoop 2.6.0 as the default build environment. ADAM and applications built on ADAM should run on a wide range of Apache Spark (1.3.1 up to and including the most recent, 1.6.0) and Apache Hadoop (currently 2.3.0 and 2.6.0) versions. A compatibility matrix of Spark, Hadoop, and Scala version builds in our continuous integration system verifies this. Please note, as of this release, support for Apache Spark 1.2.x and Apache Hadoop 1.0.x has been dropped.
The full list of changes since version 0.18.2 is below.
A few ADAM releases have been made since the last announcement; we’ll attempt to catch up here.
Prior to version 0.18.2, we made significant changes to support version 0.6.0 of the Big Data Genomics Avro data formats. We also improved performance on core transforms (markdups, indel realignment, bqsr) by using finer grained projection. Some issues in 2bitfile when dealing with gaps and masked regions were fixed. Round-trip transformations from native formats (e.g., FASTA, FASTQ, SAM, BAM) to ADAM and back have been improved. We made extending ADAM more straightforward.
ADAM now runs on a wide range of Apache Spark (1.2.1 up to and including the most recent, 1.5.1) and Apache Hadoop (currently 1.0.4, 2.3.0 and 2.6.0) versions. This is verified by a compatibility matrix of Spark, Hadoop, and Scala version builds in our continuous integration system.
The full list of changes since version 0.17.0 is below.
Special thanks to Neil Ferguson for this blog post on genomic analysis using ADAM, Spark and Deep Learning
Can we use deep learning to predict which population group you belong to, based solely on your genome?
Yes, we can – and in this post, we will show you exactly how to do this in a scalable way, using Apache Spark. We will explain how to apply deep learning using artifical neural networks to predict which population group an individual belongs to – based entirely on his or her genomic data.
This is a follow-up to an earlier post: Scalable Genomes Clustering With ADAM and Spark and attempts to replicate the results of that post. However, we will use a different machine learning technique. Where the original post used k-means clustering, we will use deep learning.
The 0.17.0 release of ADAM includes a release for Scala 2.10 and a release for Scala 2.11. We’ve been working to cleanup APIs and simplify ADAM for developers. Code that isn’t useful has been removed. Code that belongs in other downstream or upstream projects has been moved. Parquet and HTSJDK has been upgraded.
There are also some new features, e.g. you can now now
transform all the SAM/BAM files in a directory by specifying the directory and there’s a new
flatten command that allows you to flatten the schema of ADAM data to process in Impala, Hive, SparkSQL, etc; there are also many bug fixes.
ADAM 0.16.0 is now available.
This release improves the performance of Base Quality Score Recalibration (BQSR) by 3.5x, adds support for multiline FASTQ input, visualization of variants when given VCF input, includes a new RegionJoin implementation that is shuffle-based, and adds new methods for region coverage calculations.
Drop into our Gitter channel to talk with us about this release
In this post, we will detail how to perform simple scalable population stratification analysis, leveraging ADAM and Spark MLlib, as previously presented at scala.io.
The data source is the set of genotypes from the 1000genomes project, resulting from whole genomes sequencing run on samples taken from about 1000 individuals with a known geographic and ethnic origin.
This dataset is rather large and allows us to test scalability of the methods we present here and gives us the possibility to do interesting machine learning. Based on the data we have, we can for example:
- build models to classify genomes by population
- run unsupervised learning (clustering) to see if populations are reconstructed in the model.
- build models to infer missing genotypes
We’ve gone the second way (clustering), the line-up being the following:
- Setup the environment
- Collection and extraction of the original data
- Distribute the original data and convert it to the ADAM model
- Collect metadata (samples labels and completeness)
- Filter the data to match our cluster capacity (number of nodes, cpus and mem and wall clock time…)
- Read and prepare the ADAM formatted and distributed genotypes to have them into a separable high-dimensional space (need a metric)
- Apply the KMeans (train/predict)
- Assess performance
We’re proud to announce the release of ADAM 0.15.0!
This release includes important memory and performance improvements, better documentation, new features and many bug fixes.
We have upgraded from Parquet
1.6.0 in order to dramatically reduce our memory footprint. For string columns with dictionary encoding, the amount of memory used will now be proportional to the number of dictionary entries instead of the number of records materialized. Parquet 1.6.0 also provides improved column statistics and the ability to store custom metadata. We will use these features in subsequent ADAM releases to improve random access performance. Note that ADAM
0.14.0 had a serious memory regression so upgrading to
0.15.0 as soon as possible is recommended.
We are unhappy with the quality of the documentation we have been providing ADAM users and are working to improve it. With this release, all documentation has been centralized into the
./docs directory and we’re using
pandoc to convert the Markdown source into both PDF and HTML formats. We are committed to improving the content of the docs over time and welcome your pull requests!
This release includes binary distributions to make it easier for you to get up and running with ADAM. We do not include any Spark or Hadoop artifacts in order to prevent versioning conflicts. For application developers, we have also changed our Spark and Hadoop dependencies to
provided. This means that you can more easily running on ADAM using your preferred Spark and Hadoop version and configuration. We want to make deployment as easy as possible.
This release includes numerous features and bug fixes that are detailed below:
We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren’t designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.
Andy and Xavier’s talk included a demo: using Spark’s MLlib to do population stratification across 1000 Genomes in just a few minutes in the cloud using Amazon Web Services (AWS). Their talk highlights the advantages of building on open-source technologies, like Apache Spark and Parquet, designed for performance and scale.