NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance data analytics because the framework optimizes computer clusters and combines distributed storage of large data sets with parallel computation. Glenn Tamkin from NASA team, spoke at ApacheCon Conference last month and shared the details of the platform they built for the climate data analysis with Hadoop.
Scientific data services are a critical aspect of the NCCS group which provides the IT infrastructure for developing and evaluating the next generation climate data analysis capabilities. The team's goal is to provide a test-bed for experimental development of high-performance analytics as well as offer an architectural approach to climate data services.
Hadoop is well known for text-based problems but their datasets and use cases involve binary data so they created custom Java applications to read/write data during the MapReduce process. The solution uses a custom composite key design for fast data access, and also utilizes the Hadoop Bloom filter, a data structure designed to identify rapidly and memory-efficiently whether an element is present. In Hadoop terms, the BloomMapFile can be thought of as an enhanced MapFile because it contains an additional hash table that leverages the existing indexes when seeking data.
The original MapReduce application utilized standard Hadoop Sequence Files. Later they were modified to support three different formats called Sequence, Map, and Bloom. About 30-80% performance increases were observed with the addition of the Bloom filter.
He also discussed the Map Reduce Workflow of the use case. They had three different options to ingest the data into HDFS store:
- Option 1: Put the MERRA data into Hadoop with no changes. This would require the team to write a custom mapper to parse.
- Option 2: Write a custom NetCDF to Hadoop sequencer and keep the files together. This puts indexes into the files so Hadoop can parse by key and maintains the NetCDF metadata for each file.
- Option 3: Write a custom NetCDF to Hadoop sequencer and split the files apart (allows smaller block sizes). But this approach breaks the connection of the NetCDF metadata to the data.
They chose option 2 and used the sequence file format. During sequencing, the data is partitioned by time, so that each record in the sequence file contains the timestamp and name of the parameter (e.g. temperature) as the composite key and the value of the parameter (which could have 1 to 3 spatial dimensions).
Glenn discussed the datasets used in their analytics project. Data products are divided into 25 collections (18 standard and 7 chemistry products). The datasets comprise monthly means files and daily files at six-hour intervals running from 1979 – 2012, resulting in a total data size of about 80 TB. One file per month/day produced with file sizes ranging from 20 MB to 1.5 GB.