Key Takeaways
- Nowadays many customer support platforms are equipped with AI to greatly save human labor and enhance user experiences.
- The historical conversational data increases over time. Putting such data on an Apache Hadoop cluster is a scalable solution for data management and sharing.
- Analytics Zoo is a unified Analytics + AI platform for distributed TensorFlow, Keras and BigDL on Apache Spark, open sourced by Intel.
- Analytics Zoo provides rich support for reading and preprocessing text data from clusters. Users can easily use built-in models to perform natural language processing tasks in a distributed fashion.
- This article demonstrates how to build an end-to-end QA ranker application using Analytics Zoo step by step. The solution has been successfully adopted by Microsoft Azure to serve its customers.
This series of articles talks about the practices of building a customer support platform with AI using Intel Analytics Zoo on Azure by Microsoft Azure China team.
In our previous article, we have shared in detail our successful experience in building a text classification module to handle customer requests more efficiently. In this subsequent article, we will proceed to describe another important AI module in our custom service platform – QA ranker, which is used to rank and select the best answer(s) from a large set of candidates in the QA module.
The below figure demonstrates the overall architecture of our customer support platform with the Question and Answering (QA) component highlighted in orange. More background and architecture information of our customer support platform is illustrated in the previous article as well.
We have lots of Azure China customers constantly asking for help to solve the technical problems they encounter (described in Chinese) and they often want to have timely support. Thus, we intend to design a QA module with the purpose of providing as many as accurate answers (in Chinese as well) to customer’s questions with least amount of interventions from human agents. In our initial implementation, answers are given to users according to our pre-defined dialogue flows as well as information retrieval based document search, indexing and weighting.
Unfortunately, when we started working on this problem, the results returned by the QA module were not quite satisfactory. If a customer’s question falls into a pre-defined dialogue flow, the answer provided may probably be useful. However, most of the time, the pre-defined dialogue flow cannot capture the questions asked by customers, and the provided answers are not what the users expect.
To improve the results for better user experience, we decided to try using an AI technology to help with this task. Approaches that take advantage of NLP techniques combined with deep learning are a natural choice here. They allow for incremental training and evolving as data accumulates. We decided to add a deep learning QA ranker module to choose the best answers from a shortlist of candidate answers provided by the search engine.
We have adopted the built-in text matching model provided by Analytics Zoo for our scenario and integrated it into our service platform. With the newly added QA ranker module, we have seen significant performance improvement according to both our benchmark results and customer feedbacks. In the remaining parts, we will share the general steps and our practical experiences of adding a QA ranker with Intel Analytics Zoo.
What is Analytics Zoo?
Analytics Zoo is an open source unified analytics + AI platform developed by Intelfor distributed TensorFlow, Keras and BigDL on Apache Spark. The platform provides really a rich set of functionality support, including high-level pipeline APIs, pre-defined models, pre-trained models on public datasets, reference use cases, etc. With the success integration of the previous text classifier module using Analytics Zoo, we believe that Analytics Zoo is a good choice for us as well as other Azure big data users to build end to end deep learning applications on Azure. For more detailed introduction about Analytics Zoo, you can refer to this article.
What is Question Answering (QA) Ranking?
Question Answering (QA) is a common type of Natural Language Processing task, which tries to automatically answer questions posed by humans in a natural language. In our scenario, our customer support platform has a collection of FAQ texts and documentation articles available as answer corpuses, and it tries to find the best related answer from these corpuses for each question from user. Such a problem can be regarded as a text matching problem, for which we can create a model to predict the relevance score of a question and each candidate answer within a shortlist, and then rank candidate answers and return those with top scores to the customer.
Similar to text classification, training a text matching model also involves data collection, preparation of training and validation dataset, data cleaning and preprocessing, followed by model training, validation, and tuning. Analytics Zoo provides a built-in text matching model and reference examples both in Python and Scala for us to start with. See here for more detailed documentations about text matching APIs and functionalities.
Data collection and preprocessing
We maintain a collection of clean and organized candidate answers and articles (all in Chinese), each with a distinct ID. We have a collection of user questions (also in Chinese) assigned with distinct ID's, collected from various sources. Then we have human agents label the best matching answer for each of the questions. We use these data to train a text matching model for QA ranking.
Sample question and answer corpuses look like the following:
Remark: The actual contents are in Chinese. We translate them into English here for better understanding purpose.
For data loading, first we use TextSet API in Analytics Zootoload the question and answer corpuses in csv format into an TextSet based on Resilient Distributed Datasets(RDD) of texts for distributed preprocessing like below:
1. from zoo.common.nncontext import init_nncontext
2. from zoo.feature.text import TextSet
3.
4. sc = init_nncontext()
5. q_set = TextSet.read_csv("question_corpus.csv", sc, partition_num)
6. a_set = TextSet.read_csv("answer_corpus.csv", sc, partition_num)
Next we need to prepare relation files indicating the relevance between pairs of questions and answers. A pair of question and answer labelled as 1 (positive) or 0 (negative) means whether the answer matches the question or not. Since the original labeled data only has positive labels, we generate a collection of negative samples by randomly sampling from all non-matching answers for each question.
We construct separate relation files for training, validation and testing both manually and semi-automatically. Each relation record contains a question ID, an answer ID and a label (0/1). Sample relations look like the following:
Relations in csv format can also be easily read as RDD using the following API:
1. from zoo.feature.common import Relations
2.
3. train_relations = Relations.read("relation_train.csv", sc, partition_num)
4. validate_relations = Relations.read("relation_validate.csv", sc, partition_num)
The following preprocessing steps are quite similar to what we do in the text classifier module. Each input needs to go through tokenization, transformation from word to index, and sequence aligning. You can refer to corresponding section in our previous article for more details.
TextSet in Analytics Zoo provides built-in operations to help us construct the preprocessing pipeline very handily. The original implementation provided by Analytics Zoo handles English only. Since we our data are in Chinese, we made adaptations and utilize jieba with customized tokens to break Chinese sentences into words. The preprocessing part of code looks like this:
1. transformed_q_set = a_set.tokenize().word2idx().shape_sequence(q_len)
2. transformed_a_set = q_set.tokenize().word2idx(existing_map=q_set.get_word_index()) \
3. .shape_sequence(a_len)
4. word_index = transformed_a_set.get_word_index()
Internally, the above procedure first goes through the preprocessing steps for the question corpus. Then for the answer corpus, it preprocesses similarly, except that it will add new words to the word index map obtained by the question corpus so that both corpuses share the same word index. The above operations are based on RDD and thus can be easily scaled out and performed on huge question and answer datasets in a distributed fashion.
Model definition and construction
For text matching model we use the built-in K-NRM model in Analytics Zoo, which takes advantage of a kernel-pooling technique to learn ranking efficiently. Below is the architecture of the K-NRM model:
The input query and document will first go through a shared embedding layer, which typically uses pre-trained word embeddings as its initial weights. A subsequent dot layer then generates the translation matrix of which each entry represents the similarity between every two words in the query and the document. RBF kernels are used to extract multi-level soft match features, followed by a learning-to-rank layer, which combines these soft-TF features into a final ranking score. You may refer to the paper "End-to-End Neural Ad-hoc Ranking with Kernel Pooling" for more details.
The K-NRM model can be constructed out-of-box using the API below:
1. knrm = KNRM(text1_length, text2_length, embedding_file, word_index=None,
2. train_embed=True, kernel_num=21, sigma=0.1, exact_sigma=0.001,
3. target_mode="ranking")
This model expects a sequence of question indices along with answer indices and outputs a score between them. Users need to specify the lengths of the question and answer, q_len and a_len respectively. Note that with regard to word embeddings, the model supports GloVe for pre-trained English words. Again, since we are dealing with Chinese, we make modifications and choose FastText for Chinese word embeddings. The argument word_index is just the map of word and its ID generated by the answer corpus TextSet described above.
It is configurable whether to train the embedding layer or not and experiment result show that slightly adjusting the word embeddings according to the results of kernel pooling leads to a better performance. You can also specify how many kernels to use and the kernel width. Actually, the default parameters work well enough for our dataset.
This is actually a multi-purpose model whose target_mode can either be ‘ranking’ or ‘classification’. You can see the documentation for more details.
Model training, evaluation, and tuning
Now we have all the ingredients to start our training! Training and validation relations, preprocessed TextSets for question and answer corpuses are essentially what we need to train our text matching model.
The model can be trained in two ways as target_mode described above indicates. One is to train each relation record separately, as a binary classification problem and the output will be the probability that the question is related to the answer. The other is to train a pair of records jointly, with each pair consisting of a positive relation (relation with label 1) and a negative relation (label 0) of the same question, and optimize the margin within the pair. We have tried both ways and find the latter outperforms.
Pairwise training takes a pair of relations of the same question as an input. Each pair of relations consists of one relation with label 1 and the other with label 0. Thus we wrap a TimeDistributed wrapper outside the K-NRM model in this case:
1. from zoo.pipeline.api.keras.models import Sequential
2. from zoo.pipeline.api.keras.layers import TimeDistributed
3.
4. model = Sequential().add(TimeDistributed(knrm, input_shape=(2, q_len + a_len)))
Analytics Zoo also provides an API to directly generate all relation pairs given relations and preprocessed corpuses, the result of which can be directly fed into the model.
1. train_set = TextSet.from_relation_pairs(train_relations, q_set, a_set)
Then we use the convenient Keras-Style API to compile and train the K-NRM model. There’s a RankHinge loss that is especially provided for pairwise training. The hinge loss is used for maximum-margin classification and RankHinge is its variance which aims at maximizing the margin between a positive sample and a negative one.
1. model.compile(optimizer=SGD(learning_rate), loss="rank_hinge")
2. model.fit(train_set, batch_size, nb_epoch)
The tunable parameters include the number of epochs, batch size, learning rate, and etc. Users can also take snapshots during training and resume training from a snapshot later.
The evaluation of a ranking model is a bit different from the training. Basically for every validation question, we prepare a correct answer together with a number of wrong answers. We want to rank all the candidate answers in descending order according to their output scores. The higher score for those with actual label 1, the better. NDCG or MAP are common metrics to evaluate ranking tasks. Example code for listwise evaluation would be like the following:
1. validate_set = TextSet.from_relation_lists(validate_relations, q_set, a_set)
2.
3. knrm.evaluate_ndcg(validate_set, k=3)
4. knrm.evaluate_map(validate_set)
You can find the evaluation result from the log console. NDCG and MAP will both give you values from 0 to 1. If the metrics are close to 1, the most related answers are supposed to rank foremost. You can also save the summary during the training and use tensorboard to visualize the loss curve. If the metrics are relatively low or the model is not converging as expected, these indicate that the model performance is not good enough and we have to tune the model. This is generally a repeated process of checking data quality, selecting proper training hyper parameters or adjusting model arguments until we reach a satisfactory result and afterwards the trained model can go into production.
Model serving and publishing
This part is pretty much the same as what we do in the text classifier module illustrated in detail in our previous article. We use POJO-like Java inference API for our service (see here for more details). Since the preprocessing of each question and answer in QA ranker is basically the same as text classifier, these two modules share the code of this part for the sake of easy maintenance. Analytics Zoo also provides web service examples (including text classification and recommendation) for us to refer to.
As we are continuously collecting user feedback, we will have more and more relations to periodically re-train and publish the updated ranking model.
Conclusion
Here comes to the end of this article. To conclude, this article demonstrates our process to successfully build a QA ranker module on Azure big data platform using Intel Analytics Zoo. You can follow our steps above and refer to the guidance and examples provided by Analytics Zoo to add it into your own application or service as well! We will continue to introduce more practical experience in building our customer support platform in the following articles in this series.
For more information, please visit the project homepage of Analytics Zoo on Github, and you can also download and try the image preinstalled with Analytics Zoo and BigDL on Azure marketplace.
About the Authors
Chen Xu is a senior software engineer at Microsoft. He leads the Mooncake Support Chatbot AI component design and development.
Yuqing Wei is a software engineer at Microsoft, focusing on Big Data Platform and related technologies. She contributes to Mooncake Support Chatbot AI development.
Shengsheng (Shane) Huang is a senior software architect at Intel and an Apache Spark committer and PMC member. She has 10+ years of experiences on Big Data and now serves a leading role in the development of distributed deep learning infrastructure and applications on Apache Spark.
Kai Huang is a software engineer at Intel. His work mainly focuses on developing deep learning frameworks on Apache Spark and helping customers work out end to end deep learning solutions on big data platforms.