InfoQ interviewed David Smith, VP of Community for Revolution Analytics at the Strata big data conference. Revolution provides commercial extensions for the open source R statistics project and announced the R Enterprise v4.2 Suite along with offering tools to help SAS users to migrate to R. David discussed that, how their Enterprise R complements Hadoop and big data processing, and changes in needs moving from black box models to data exploration.
InfoQ asked why customers are considering using Revolution Analytics instead of SAS
Revolution Analytics is taking R from the dynamic startup companies to the broader business world. The larger businesses they talk to see themselves as being taken over a bit by the startups, and are looking for modern analytic methods rather than the traditional approach of building black box models. This means doing individual-level data analysis and investigation, rather than building top-down models on highly sampled data sets.
We asked how does Revolution Analytics works with or replaces Hadoop
We don't see R and what we've produced as being in competition with Hadoop. Hadoop is a great way to store large amounts of unstructured data and a way to process it. What we are doing is building predictive models with it. We've built some interesting predictive algorithms that out of memory by streaming (data into RAM from files), and we allow processing in parallel so you can throw lots of cores at the problem, integrated into the R environment. This is coupled with a big data store we've built, XDF, which is not a replacment for Hadoop or a database. It's more like a NoSQL (engine) - it's a local efficient file store to hold data.
We asked if Revolution Analytics provides more of a scale-up architecture for a single-machine instead of clustered calculations:
It depends on the application. It's rare in stats to do a regression on billions of rows. The application we've focused on first is that you've got your data in Hadoop or a database, and you put information in a local filesystem, aggregated for analysis.The next step for us is to push analytics into database. We are talking with likes of Cloudera to put the analytics into (Hadoop's) HDFS, so you can process locally and aggregate (the results) into map-reduce. We are also talking to Netezza, and various relational database vendors, as well as some of the open source software data vendors like Talend.
InfoQ asked if this approach includes (distributed local-state) algorithms like Gibbs Sampling?
It allows for any algorithm you can break down into independent parts and recombine. We include implementations of standard regression, linear regression, and cross-tabulation algoirthms - they are commercial not open source.
InfoQ asked how Revolution Analytics' Hadoop integration differ from the open source RHIPE. David indicated it is a new development by the author of RHIPE, which is similar but is designed to support optimized execution of parallel algorithms. David also explained that Enterprise R extends the open source R by:
- optimizing the build for multi-threaded execution
- integrating with Intel's high performance matrix libraries
- including visualization tools
- providing access to big data models
- offering customer support.
We have a number of users in the finance industry including hedge funds and banks who use it for modeling mortgage default risk. SAS has been the standard for commercial company modeling for the past thirty years. Customers are interested in alternatives to save money, and in doing more with their data: it's one thing to have the data, it's another thing to explore the data, figure out the best way to aggregate it. For example, how do you deal with missing data, or go from a variable like job title, and boil it down to something you can analyze. We're interested, especially at this (Strata) conference, in providing tools to work with the data to get it ready for analysis. People are turning away from traditional tools like SAS, where you had a data set that you pushed through a black box, and got some answers. This cadre of graduates coming out of school is more interested in looking at data, how to transform it, and looking at variables ...
In a black box long tail events become a bad thing. Conversely, with a more exploratory approach, some variable becomes highly significant that you never thought would be. You can discover the outliers; they can be important in terms of business strategy, reacting, and learning.