Ayasdi announced last month a partnership with Cloudera, the biggest distributor of Apache Hadoop. The partnership will ensure the compatibility of their solution with Cloudera Enterprise 5, the latest version of Cloudera’s big data platform based on Apache Hadoop.
Ayasdi, which means "to seek" in Cherokee, is a data analysis start-up founded in 2008 by three mathematicians to commercialize a novel approach to discover insight from high-dimensional and large data sets. Initially funded by the Defense Advanced Research Project (DARPA) and the National Science Foundation (NSF), this approach is based on a new area of mathematics called Topological Data Analysis (TDA) and allows customers such as General Electric, Merck, the US Food and Drug Administration (FDA) or the Centers for Disease Control and Prevention (CDC) to highlight the geometric structures of their data and build compact summaries that are easier to visualize and easier to explore interactively, without the need to write query or algorithm.
The fundamental idea behind is that data has shape and shape has meaning. In the context of TDA, data are typically represented as a large finite set of points in space, and shapes are used to describe how points are related to each other. For example, a simple aspect of shape "how many pieces do they break into", or "how do they break into cluster" can reveal conceptually different parts of a phenomena. Another aspect "is there any loop" can indicate a recurrent or periodic behavior. Topology is precisely that branch of mathematics that studies this notion of shape and TDA aims to extend that mathematical formalism for defining and measuring qualitative geometric information to large and noisy cloud points.
From a technical perspective, Ayasdi Platform leverages CDH 5's Hadoop Distributed File System (HDFS) for managing large amounts of customer data and uses HBase for storing some of its operational metadata that is randomly accessed and frequently updated. According to Lawrence Spracklen, Chief Architect at Ayasdi, the company also leverages Spark for various ETL activities and is partnering with Intel to help drive performance, robustness and minimal overhead security but it uses a custom non-hadoop stack to compute and distribute topological networks.
Asked about the size of the current largest Cloudera cluster used by Ayasdi's customers, Lawrence explains us that scale is only one part of the story.
The Ayasdi Data Platform is horizontally scalable, and readily scales to many tens of nodes. However, the size of the Cloudera cluster is far less interesting than the overall size of the data set coupled with the complexity of the data. As a frame of reference, advancements in data analysis technology haven’t kept up with this fast paced data environment of change and growth. The lack of development around data analysis technology stems from the inherent challenges posed by the computations and techniques needed in order to analyze extremely complex, high dimensional data. Ayasdi’s solution was built to find insights in extremely complex data. Examples include data sets of medical disorders and disease e.g. cancer, PTSD, and diabetes. These data sets are not always big but tend to be extremely complex.
YouTube hosts videos (one, two) that explore TDA in more detail. The International Conference on Machine Learning (ICML) offers a PDF to the same end, as has a Topological Data Analysis and Machine Learning Theory workshop held in 2012. For a more concrete example, the Institute for Computational and Mathematical Engineering at Stanford University (ICME) presents a computational method for extracting simple descriptions of high dimensional data.