In a recent blog post, Google announced they have open-sourced BERT, their state-of-the-art training technique for natural language processing (NLP) applications. Google has decided to do this, in part, due to a lack of public data sets that are available to developers. BERT also includes a new bidirectional technique which improves its effectiveness in NLP. To reduce the amount of time required for developers and researchers to train their NLP models, Google has made optimizations in Cloud Tensor Processing Units (TPUs) which reduces the amount of time it takes to train a model to 30 minutes vs a few hours using a single GPU.
Google feels there is a shortage of NLP training data available to developers. Jacob Devlin and Ming-Wei Change, research scientists at Google, explain why it was important to share their datasets:
One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples. However, modern deep learning-based NLP models see benefits from much larger amounts of data, improving when trained on millions, or billions, of annotated training examples.
Researchers at Google have also developed a new technique, called Bidirectional Encoder Representations from Transformers (BERT), for training general purpose language representation models using a very large data set that includes generic text from the web, also referred to as pre-training. Devlin explains why this pre-training is important:
The pre-trained model can then be fine-tuned on small-data NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy improvements compared to training on these datasets from scratch.
BERT includes source code that is built upon TensorFlow, an open-source machine learning framework, and a series of pre-trained language representation models. Google has published an associated paper where their state-or-the-art results on 11 NLP tasks are demonstrated, including how it performed against the Stanford Question Answering Dataset (SQuAD v1.1).
The difference between BERT and other approaches is based upon how it works in pre-training contextual representations, including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. But, what really sets BERT apart is the first deeply bidirectional, unsupervised language representation which is pre-trained using only a plain text corpus. Devlin explains why this approach is unique:
Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word "bank" would have the same context-free representation in "bank account" and "bank of the river."
Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence "I accessed the bank account," a unidirectional contextual model would represent "bank" based on "I accessed the" but not "account." However, BERT represents "bank" using both its previous and next context — "I accessed the ... account" — starting from the very bottom of a deep neural network, making it deeply bidirectional.
Unidirectional models work by predicting each word based upon previous words within a sentence. Historically, bidirectional training was difficult as a word would inevitably be able to see itself in multi-layer models when the 'next' predicted word looks back to the 'previous' predicted word. BERT addresses this challenge through the use of masked words.
Image Source: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
In addition to new techniques included in BERT, Google has made enhancements to Cloud TPUs which allow developers and researchers to quickly experiment, debug and tweak models. These investments allowed Google to exceed the capabilities of existing pre-training models.
BERT is available from GitHub in addition to the tensor2tensor library.