Key Takeaways
- A recent trend in the analysis of texts goes beyond topic detection and tries to identify the emotion behind a text. This is called sentiment analysis, or also opinion mining and emotion AI.
- Sentiment analysis is widely applied in voice of the customer (VOC) applications as in analyzing responses in a questionnaire or free comments in a review.
- Extracting sentiment from a text can be done using techniques like natural language processing, computational linguistics, and text mining.
- Text mining can be performed using Machine Learning (ML) or a lexicon based approach.
- Lexicon-based approach relies on the words in the text and the sentiment they carry. This technique uses NLP concepts and a dictionary to extract the tone of the conversation.
- ML-based approach needs a sentiment-labeled collection of documents; this is a collection in which each document has been manually evaluated and labeled in terms of sentiment. After some preprocessing, an ML-supervised algorithm is trained to recognize the sentiment in each text.
Besides understanding what people are talking about, it is sometimes important to understand the tone of the conversation.
Sentiment Analysis
A relatively more recent trend in the analysis of texts goes beyond topic detection and tries to identify the emotion behind a text. This is called sentiment analysis, or also opinion mining and emotion AI.
For example, the sentence “I love chocolate” is very positive with regards to chocolate as food. “I hate this new phone” also gives a clear indication of the customer preferences about the product. In these two particular cases, the words “love” and “hate” carry a clear sentiment polarity. A more complex case could be the sentence “I do not like the new phone”, where the positive polarity of “like” is reversed into a negative polarity by the negation. The same for “I do not dislike chocolate”, where the negation of a negative word like “dislike” brings a positive sentence.
Sometimes the polarity (i.e. positivity or negativity) of a word is context dependent. “These mushrooms are edible” is a positive sentence with regards to health. However, “This steak is edible” is a negative sentence with regards to a restaurant. Sometimes the polarity of a word is delimited in time, like “I like to travel, sometimes.” where sometimes limits the positive polarity of the word “like”. And so on to even more subtle examples like “I do not think this chocolate is really great” or even worse “Do you really think this concert was so fantastic?”.
We have talked here about positive and negative sentiment. However, POSITIVE and NEGATIVE are not the only labels you can use to define the sentiment in a sentence. Usually the whole range VERY NEGATIVE, NEGATIVE, NEUTRAL, POSITIVE, and VERY POSITIVE is used. Sometimes, however, additional less obvious labels are also used, like IRONY, UNDERSTATEMENT, UNCERTAINTY, etc.
How can we extract sentiment from a text? Sometimes even humans are not that sure of the real emotion in between the lines. Even if we manage to extract the feature associated with sentiment, how can we measure it? There are a number of approaches to do that, involving natural language processing, computational linguistics, and finally text mining. We will concern ourselves here with the text mining approaches, which are mainly two: a Machine Learning (ML) approach and a lexicon based approach.
The lexicon-based approach relies on the words in the text and the sentiment they carry. This technique uses NLP concepts and a dictionary to extract the tone of the conversation.
The ML-based approach needs a sentiment-labeled collection of documents; this is a collection in which each document has been manually evaluated and labeled in terms of sentiment. After some preprocessing, an ML-supervised algorithm is trained to recognize the sentiment in each text.
What You Need
KNIME Analytics Platform – We’ll use KNIME data analysis tools to show how to develop a sentiment analysis solution.
KNIME Analytics Platform is an open source software for data science for data scientists, data analysts, big data users, and business analysts. It covers all your data needs from data ingestion and data blending to data visualization, from machine learning algorithms to data wrangling, from reporting to deployment, and more. It is based on a Graphical User Interface for visual programming, which makes it very intuitive and easy to use, considerably reducing the learning time.
It has been designed to be open to different data formats, data types, data sources, data platforms, as well as external tools, like Apache Tika libraries, Keras and Python for example. It also includes a number of extensions for the analysis of unstructured data, like texts or graphs.
For text processing, the KNIME Text Processing extension offers a wide variety of IO, cleaning, processing, stemming, keyword extraction, and more text processing related nodes.
Given all those characteristics - open source, visual programming, and codeless Text Processing integration - we have selected it to implement a sentiment analysis application.
Computing units in KNIME Analytics Platform are small colorful blocks, named “nodes”. Assembling nodes in a pipeline, one after the other, implements a data processing application. A pipeline is called “workflow” (Figure 1).
KNIME Analytics Platform is open source. It can be downloaded and used for free.
Download the installable package for your operating system from here. Then install it, following these video instructions:
The IMDb data set – To evaluate sentiment in sentences or texts, we need some examples, of course. We used here the data set of movie reviews provided by IMDb. This data set includes 50,000 movie reviews, each one manually labeled for sentiment. Sentiment classes are balanced: 25,000 are negative reviews and 25,000 are positive reviews.
If we pursue the NLP lexicon-based approach, we also need a dictionary of words with the sentiment they carry; that is at least a list of negative words and a list of positive words. We got those lists from the MPQA Corpus.
If we pursue the machine learning-based approach, we need a sentiment label for each of our text examples. The IMDb data set provides a positive vs. negative label, manually evaluated for each review.
The Workflows
NLP-based Sentiment Analysis – The workflow for the lexicon-based sentiment analysis needs to:
- Clean and standardize the texts in the document collection.
- Tag all words as positive or negative according to the dictionary lists provided by the MPQA Corpus. (To do that we use the dictionary tagger node twice. All other words are removed.)
- Extract all remaining words from each document with a Create BoW node.
- Calculate the sentiment score for each document as:
- Sentiment score = (# positive words - # negative words) / (# words in document)
- Define a threshold value as the average sentiment score.
- Subsequently classify documents as:
- positive if sentiment score > threshold
- negative otherwise
- If you want to be more cautious, you can define positive and negative thresholds as:
- thresholds = avg (sentiment score) ± stddev (sentiment score)
Thus, all documents with sentiment score in between the two thresholds can be classified as neutral.
Figure 1. NLP lexicon-based approach to sentiment analysis. Here you need a set of NLP-based rules, which in the simplistic case is based on just two lists of words: one list for positive words and one list for negative words. Here the two lists of words are taken from the MPQA Corpus.
If we assign colors to the IMDb reviews according to the predicted sentiment — green for positive and red for negative sentiment — we get the result table in Figure 2.
Note: This is a very crude calculation of the sentiment score. Of course, more complex rules could be applied, for example, inverting the word sentiment polarity after a negation and taking into account the time evolution of the sentence.
Figure 2. Movie reviews with predicted sentiment from NLP-based approach: red for negative and green for positive sentiment.
The workflow in Figure 1, with just a sample of the original IMDb data set, is downloadable for free from the KNIME examples under
Other Analytics Types/Text Processing/Sentiment Analysis Lexicon based Approach
or within KNIME Analytics Platform in the EXAMPLES list under:
08_Other_Analytics_Types/01_Text_Processing/26_Sentiment_Analysis_Lexicon_Based_Approach (see video on how to connect and download workflows from the KNIME EXAMPLES Server).
ML-based Sentiment Analysis – The application implementing the machine learning-based approach to sentiment analysis needs to:
- Again clean and standardize the texts in the documents.
- Extract all words from the documents with a Create BoW node.
- Produce a text vectorization of the document with a Document Vector node.
- Train a ML algorithm to recognize positive vs. negative texts.
- Evaluate the created model on test documents.
Note: The training and test phase is implemented exactly as in any other machine learning-based analysis. Here we used a decision tree because the data set is quite small, but any other supervised ML algorithm could be used: deep learning, random forest, SVM or any other.
Figure 3. Machine learning-based approach for sentiment analysis. Here we train a decision tree, but any other supervised ML model for classification could be used.
Instead of extracting all words from the text, in the interest of execution speed and input table size, we could extract only the main keywords with one of the keyword extractor nodes. Keyword extraction, however, could limit the original word set and cut out important sentiment-related words, which might lead to diminished performance in terms of sentiment classification. If you decide to go with a smaller set of words, just make sure that performances are not terribly affected.
The Document Vector node, following the preprocessing stage, performs a one-hot encoding of the input texts, without preserving the order of the words in the sentence.
The Scorer node at the end of the workflow calculates a number of accuracy measures on the test set produced earlier on by the Partitioning node. The final accuracy is around 71 percent, and the Cohen’s kappa is 0.42 based on a random stratified partition of 70 percent vs. 30 percent of the original data respectively for training set and test set.
Usually, the machine learning-based approach performs better than the dictionary-based approach, especially when using the simple sentiment score adopted in our NLP approach. However, sometimes there is no choice because a sentiment-labeled data set is not available.
This workflow is downloadable for free from the KNIME examples under
Other Analytics Types/Text Processing/Sentiment Classification
or within KNIME Analytics Platform in the EXAMPLES list under:
08_Other_Analytics_Types/01_Text_Processing/03_Sentiment_Classification
(see video on how to connect and download workflows from the KNIME EXAMPLES Server)
Deployment
Deployment workflow for an NLP-based sentiment analysis is practically the same as the training workflow in Figure 1. The only difference occurs in the threshold calculation. Indeed, the threshold is calculated only in the training workflow on the training set and then just adopted in the deployment workflow.
The deployment workflow for a machine learning-based sentiment analysis looks like any other ML-based deployment workflow. Data are imported and preprocessed as needed, the model is acquired, and data are fed into the model to produce predictions that are presented to the end user.
Word Sequences and Deep Learning
The text vectorization used in the ML-based approach transforms words into vectors of 0/1 (one-hot encoding), where 1 shows the presence of a word and 0 its absence. The time sequence of the words in the sentence is not necessarily preserved. This is also acceptable as long as we use ML algorithms that do not take the sequence order into account, for example, a decision tree.
A variation of the one-hot encoding is the frequency based encoding, where instead of using 0/1 for absence /presence of the word, the word frequency is used for word presence and 0 again for word absence.
Another type of encoding is index-based. In this case a word is coded by means of an ID, usually a progressive integer index.
One special machine learning algorithm that works well for sentiment analysis is a deep learning network with a Long Short-Term Memory (LSTM) layer. Indeed, Recurrent Neural Networks (RNN) and especially LSTM networks have been recently used to explore the dynamic of time series evolution. We could use them to explore the dynamic of word sequence to better predict text sentiment as well.
LSTM units and layers are available in the KNIME Analytics Platform through the KNIME Deep Learning Extension – Keras Integration.
The KNIME Deep Learning Extension integrates deep learning functionalities, networks and architectures from TensorFlow and Keras in Python. Even though this extension allows you to write Python code to run the TensorFlow/Keras libraries, it also allows you to assemble, train and apply Keras networks through the traditional KNIME Graphical User Interface, based on nodes, drag and drops, clicks, configuration windows, and traffic light status. This last option makes the whole assembling and training process codeless and, therefore, much easier and faster, especially for prototyping and experimentation. A mix and match approach is also possible.
Indeed, to build a deep learning network with an input layer, an embedding layer, an LSTM layer, and a dense layer, we just need four nodes:
- Keras Input Layer,
- Keras Embedding Layer,
- Keras LSTM Layer, and
- Keras Dense Layer nodes (Figure 4).
The network training is obtained via the Keras Network Learner node (Figure 4), which also includes the conversion to the appropriate encoding, as required by the first layer, and the selection of the loss function (binary cross-entropy).
Finally, the network application to new data is implemented with the generic DL Network Executor node, which takes any trained Keras or TensorFlow network and executes it on the new data.
Notice that the DL Network Executor node is not the only node able to execute a TensorFlow/Keras deep learning network. The DL Python Network Executor node can also do that. The difference between the two nodes is in the usage of GUI or script. Configuration settings in the first node are passed via GUI, while configuration in the second node is just a Python script. The first node is then easier to use, but, of course, less flexible. The second node requires Python code knowledge but can customize the network execution in more detail.
And here we are! We have assembled, trained and applied a four-layer neural network, including an LSTM layer, with just six Keras deep learning nodes.
Since LSTM units are capable of learning evolution over time — that is evolution over an ordered sequence of input vectors — this might turn out to be useful when learning negations or other language structures where word order in the sequence is important.
Figure 4. The first four gray metanodes in the workflow preprocess the texts to build same-length sequences of index encoded words. Truncation and zero padding are applied to fill up or cut sentences that are too short or too long, respectively. The result of such preprocessing and the input to the neural network can be seen in Figure 5, Keras-based deep learning network for sentiment analysis. The network consists of four layers: an input layer, an embedding layer, an LSTM layer, and a dense layer, as you can see from the top brown nodes. This network structure is then trained by the Keras Network Learner node and applied via the DL Network Executor node.
Figure 5. Input index coded sequences of words to the deep learning network.
Accuracy and Cohen’s kappa for this deep learning network have been evaluated on the test set respectively as 81 percent and 0.62 for the same partition used with the decision tree.
In this example, we worked on a small data set with only 50,000 reviews. On such a data set, the decision tree is already performing quite satisfactorily (71 percent accuracy), but the deep learning network adds a 10 percent performance increase (81 percent accuracy). In any case, we hope that it was useful to see the practical steps of the machine learning approach implementation with both algorithms. The deep learning workflow, with the entire IMDb dataset, is downloadable for free from the KNIME public EXAMPLES server under:
04_Analytics/14_Deep_Learning/02_Keras/08_Sentiment_Classification_with_Deep_Learning_KNIME_nodes
Conclusions
We have described two basic techniques for sentiment analysis.
The first one is NLP-based and requires a dictionary of sentiment-labeled words and a set of more or less complex rules to determine a sentence sentiment from the words and grammar in it. Rules might be complex to establish (negations, sarcasm, dependent sentences, etc.), but in the absence of a sentiment-labeled data set, sometimes this is the only viable option.
The second one is ML-based. Here we train an ML model to recognize sentiment in a sentence based on the words in it and a sentiment-labeled training set. This approach depends heavily on the ML algorithm used and the document numeric representation. One current trend is to use time sequence algorithms to recognize not just the words but also their respective order in the sentence. As an example of this particular approach, we showed a deep learning neural network including an LSTM layer.
All workflows used in this article are available for free on the KNIME EXAMPLES public server under:
08_Other_Analytics_Types/01_Text_Processing and 04_Analytics/14_Deep_Learning/02_Keras
(see video on how to access and download workflows from the KNIME EXAMPLES Server)
About the Authors
Rosaria Silipo, Ph.D., principal data scientist at KNIME, is the author of 13 technical publications, including her most recent book "Practicing Data Science: A Collection of Case Studies". She will be conducting a free webinar, “Sentiment Analysis: Deep Learning, Machine Learning, Lexicon Based,” on November 27, 2018, at 9 a.m. Pacific time/12 p.m. Eastern time/6 p.m. Central European time.” She holds a doctorate degree in bio-engineering and has spent most of her professional life working on data science projects for companies in a broad range of fields, including IoT, customer intelligence, the financial industry, and cybersecurity. Follow Rosaria on Twitter, LinkedIn and the KNIME blog.
Kathrin Melcher is a data scientist at KNIME. She holds a master degree in mathematics from the University of Konstanz, Germany. She enjoys teaching and applying her knowledge to data science, machine learning and algorithms. Follow Kathrin on LinkedIn and the KNIME blog.