VMware have announced the availability of Spring Hadoop, which integrates the Spring Framework and the Apache Hadoop platform. The project provides a convenient mechanism for the configuration, creation, and execution of the various services and utilities such as MapReduce, Hive, Pig, and Cascading jobs via the Spring container. In addition, the project provides HDFS data access support through JVM scripting languages such as Groovy, JRuby, Jython, and Rhino, declarative configuration support for HBase, and declarative and programmatic support for Hadoop Tools, including FS Shell and DistCp.
Perhaps more interestingly, the tool also provides a convenient way for Spring-based applications to use Hadoop as an analysis tool for data coming from multiple sources, such as Spring Integration and Spring Batch, as well as conventional relational databases. "You could, for example, have a Hadoop job be a tasklet inside a Spring Batch environment, so we can start to coordinate it and have triggers when a job completes," SpringSource CTO, Adrian Colyer, told InfoQ.
Or we might have Spring Integration watching on a directory for a file to arrive, and then use that as a trigger to initialize a Hadoop job. You can really start to integrate this into the Spring world, and use all the other components to stitch Hadoop and its various data processing facilities into your standard enterprise toolkit.
Both this project, and the Spring Data initiative more generally, reflect the growing importance of both NoSQL and Big Data within Enterprise applications. Colyer explained
After a long period of time, maybe a decade, when data from an enterprise application perspective meant, "How am I going to talk to my relational database?" and the solutions were fairly obvious, enterprise data is now starting to look really quite different. We're seeing a whole range of different stores and approaches, and it has become increasingly obvious that one very important, and growing, part of the enterprise data story is Big Data and batch data processing.
Thus, as well as the newly announced Hadoop project, the broader objective is to have first class support for each of the different styles of SQL and NoSQL stores - relational databases, graph databases, document database, key/value stores and so on - and also explicitly support some of the more popular products of that type. Currently this includes support for JPA, as well as MongoDB, Redis, and Neo4J, with Cassandra also in the pipe-line. Colyer suggested that this list, in turn, reflects the front-runners that SpringSource are currently seeing amongst their enterprise customers, though it should be emphasized that the adoption of NoSQL data-stores is still in the very early stages in most large enterprises.
VMware will be hosting a session to introduce the Spring Hadoop project today at the O’Reilly Strata Conference in Santa Clara, California.