Solutions architects Christopher Crosbie and Ujjwal Ratan with Amazon Web Services (AWS) detailed their precision medicine modeling with Spark on Amazon's Elastic MapReduce (EMR) and the recently open-sourced ADAM project, a genomics data compute platform built on Apache Avro and Parquet.
On why Spark with EMR is appeaking to healthcare researchers Crosbie and Ujjwal noted that:
Most treatment plans are deeply subjective, due to the many disparate pieces of information that physicians must tie together to make a health plan based on the individual's specific conditions.
The large volume of data coupled with complexities of the data sets themselves requires existng models to execute large numbers of permutations, which can be parallelized with Spark on EMR.
To properly collect and curate biological data for comparisons and to correlate outcomes, as well as biomarker tracking across varying patient populations, researchers face challenges around scaling algorithms and code that weren't necessarily engineered for cloud scale. On facing such challenges Crosbie and Ujjwal noted that:
The anachronism for many of the most common genomics algorithms today is the failure to properly optimize for cloud technology... Memory requirements are often limited to a single compute node or expect a POSIX file system... a shift to cloud computing will be necessary as we move to the full-scale population studies that will be required to develop novel methods in precision medicine... As with most areas of science, building knowledge often goes hand in hand with legacy analysis features that turn out to be outdated as the discipline evolves.
Parquet's columnar storage format is reportedly well suited for reducing the I/O load associated with retrieving more genomic data than what's necessary for a specific computation, as well as optimizing storage compression ratios since the dimensions of the data make it fit well into columnar storage format.
Parquet and HDFS enabled S3 storage, as well as a new Avro schema and file format bdg, and in-memory caching via Spark drove a reported 25% decrease in file storage size compared to platforms built on bam files like Hadoop-BAM, and produced a reported 50x performance gain on a 100-node cluster. Crosbie and Ujjwal noted performance gains from a 2013 University of California at Berkeley EECS paper but went on to demonstrate running ADAM Spark on EMR using data from the 1000 genome project. They also noted AWS' public data sets program that provides public access to popular genomics data sets like TCGA and ICGC without having to pay for raw storage.
Spark doesn't readilly address every bioinformatics problem though. Some processes don't lend themselves to parallelizing small tasks. Crosbie and Ujjwal noted gene sequence assembly as one of the tasks that won't necessarily transition to Spark well.
Comments around the need for ADAM came up in reference to the Broad Institute's recent collaboration with Cloudera and the release of the GATK4 pipeline running on Spark. Crosbie and Ujjwal cited ADAM as an example of where Spark on EMR could be applied in the bioinformatics discipline but did not endorse it over other solutions. They mentioned the VariantSpark project as a similarly architected platform in the space, but picked ADAM for demonstration purposes due to its stable release and having been an early but well known Spark project in bioninformatics.