Corporations are increasingly using the social media to learn more about what their customers are saying about their products and their company. This presents unique challenges as unstructured content needs analytic techniques that involve taking words and associating quantitative metrics with them to interpret the sentiment embodies in them.
Subramanian Kartik and the Greenplum team at EMC worked on a research project involving the analysis of blog posts using both MapReduce and the Python Natural Language Toolkit, in combination with SQL analytics in an EMC Greenplum Database, using sparse vectors and k-means clustering algorithms.
Subramanian spoke about this research at NoSQL Now 2011 Conference this year. InfoQ caught up with him to learn more about the project and the architecture behind the solution they developed.
InfoQ: Can you talk about the Greenplum research project on blog sentiment analysis and the architecture behind it?
Kartik: Sentiment analysis refers to the application of natural language processing, computational linguistics and text analytics to extract subjective information from source materials, in this case, from a set of blog posts. This kind of analysis is becoming very popular in the analysis of social media data as well as for more conventional use cases such as e-Discovery among many of our customers. The end goal is to discover patterns of dominant concepts in a collection of documents, and to associate subsets of these documents with these concepts. An example of this, described here, would be to discover major themes in what bloggers write about concerning a company's products and services.
The architecture spans elements in natural language processing (which is in the realm of unstructured data) on one hand, to quantifying this numerically and using machine learning algorithms for pattern recognition (typically done with structured data). The intriguing part of this architecture is how we use "Not Only SQL" approaches to solve this by utilizing a combination of Map/Reduce (typically used with Hadoop) and SQL-based analytics in a traditional RDBMS.
InfoQ: What technologies, programming techniques and algorithms were used in the solution?
Kartik: We used the EMC Greenplum database technologies to explore a corpus of blogs for such patterns. Although primarily an MPP relational database, Greenplum is very well suited to this sort of analysis as it has the ability to process structured and unstructured data using user code written in high level languages such as Perl and Python. The code is executed in-database using either the Map/Reduce interface to Greenplum or through SQL-callable User Defined Functions. The pipeline of data processing starts with the ingestion and parsing of the blog files (which are HTML files), defining quantitative text metrics to characterize each blog with a numerical vector, followed by the use of machine learning algorithms to cluster the blogs around dominant themes. The output is a set of clusters centered around key themes, or "sentiments", present in the corpus of blogs.
We used the support for Python in Greenplum to leverage the Python Natural Language Toolkit (NLTK) for HTML Parsing, tokenization, stemming and stop word removal from the original text, and then load the resulting term lists into a relational table. The beauty of this is that we could leverage the power of the open source NLTK code base with no modification and invoke it directly inside Greenplum using our Map/Reduce interface. Once we have term lists in a database table we cull it to form a well defined dictionary of terms across the corpus, and use an industry standard computational linguistics metric, term frequency X inverse document frequency (tf-idf), to create vectors that uniquely describe each blog document. This gave us a vector space on which we define a "distance", and use a well known clustering algorithm in unsupervised learning, k-means clustering, to obtain the final result. Once again, all these operations are done in-database, that is, without having to pull the data out of the RDBMS and processing in place. Both tf-idf and NLTK, as well as k-means clustering are described elsewhere (e.g. Wikipedia) for the curious reader.
InfoQ: What were the main challenges in parsing and analyzing the blog data compared to the data analysis of traditional relational data?
Kartik: It's human language! Language is inherently ambiguous to machines. But beyond the English language, every industry has its own ontology for terms and abbreviations. The mobile phone blogs look very different from the health care blogs with all kinds of special words and special meanings for words. Not to mention the plethora of abbreviations in social media that have to be accounted for. The specific metric used here, with tf-idf, is one of the ways to characterize a document, and is acceptable to characterize a whole blog entry. In reality, subsections of the post may refer to different topics, or express different sentiment. More sophisticated machine learning algorithms such as Topic Models (Latent Dirichlet Allocation) may be needed to accomplish more granular analysis. Some practitioners associate sentiment on a sentence by sentence basis to account for the variation present in a post. This is a vast and growing field of research. The work done here just scratches the surface of a very complex area of research. Relational data in contrast is usually numerical, and we have 30 years of SQL based analytics and maturity in tools/techniques in machine learning to fall back on, so these are much better understood and easier to implement.
InfoQ: What are some NoSQL data patterns or best practices that Greenplum learned from work on this project?
Kartik: The most important lesson is the realization that one must keep an open mind as to what techniques we should use to solve a problem. One could certainly do this work completely in Hadoop, but the elegance of using both Map/Reduce and SQL to solve such a problem highlights the strength of both these paradigms. As the amount of data increases, the dominant data pattern we see emerging is the need to process data in place, and to harness the computational power of the MPP architecture underlying Greenplum to also do computation in-database. As data volumes grow, we have to bring compute close to data and thus choosing the technology that supports this is critical. The ability to bring open source tools to bear is very important as well - this lets us leverage much intellectual capital without re-inventing the wheel. Once again, the underlying platform architecture and its ability to run user code in-database was critical to enable this.
InfoQ: What is the future road map of this project?
Kartik: This effort paints a picture of capabilities that can be extended for a variety of use cases by our customers and internally. I would love to use these techniques to analyze live Twitter feeds during a major industry event such as EMC World or VMWorld, and associate some visualization technologies with this as well. Theoretically, Topic Models (such as LDA) are interesting to study in this context as an alternative to k-means clustering - Greenplum can do both in-database. Scale is always an interesting frontier to explore as well, and ultimately multiple-language analysis may require even more innovation.