Meta AI’s research and advancements team recently developed a neural-network-based system, called SIDE, that is capable of scanning hundreds of thousands of Wikipedia citations at once and checking whether they truly support the corresponding contents.
Wikipedia is a multilingual free online encyclopedia written and maintained by volunteers through open collaboration and a wiki-based editing system. Wikipedia has some 6.5 million articles. Wikipedia is crowdsourced, so it usually requires that facts be corroborated; quotations, controversial statements, and contentious material about living people must include a citation. Volunteers double-check Wikipedia’s footnotes, but, as the site continues to grow, it’s challenging to keep pace with the more than 17,000 new articles added each month. Readers commonly wonder about the accuracy of the Wikipedia entries they read. The human editors need help from the technology to identify gibberish or statements that lack citations but understand that determining whether or not a source backs up a claim is a complex task for AI, because it needs a deep understanding to perform an accurate analysis.
For this purpose, Meta AI research team created a new dataset of 134 million public webpages (split into 906 million passages of 100 tokens each), an order of magnitude more data than the knowledge sources considered in current NLP research and significantly more intricate than ever used for this sort of research. The next largest dataset in terms of passages/documents is the Internet Augmented Dialog generator, which pulls data from 250 million passages and 109 million documents.
This new dataset is the knowledge source of the neural network model which finds the citations that seem to be irrelevant and suggests a more applicable source event, pointing to the specific passage that supports the claim. Natural-language-understanding (NLU) techniques are used to perform the tasks that allow the system to evaluate a citation. In NLU, a model translates human sentences (or words, phrases, or paragraphs) into complex mathematical representations. The tool is designed to compare these representations in order to determine whether one statement supports or contradicts another.
The new dataset also serves as one of the system’s main components: Sphere, which is a web-scale retrieval library and is already open-sourced.
The decision flow of SIDE, from a claim on Wikipedia to a suggestion for a new citation, works as follows:
SIDE workflow. From paper: Improving Wikipedia Verifiability with AI
The claim is sent to the Sphere Retrieval Engine, which produces a list of potential candidate documents from the Sphere corpus. The sparse retrieval sub-system uses a seq2seq model to translate the citation context into query text, and then matches the resulting query (a sparse bag-of-words vector) on a BM25 index of Sphere. The seq2seq model is trained using data from Wikipedia itself: the target queries are set to be web page titles of existing Wikipedia citations. The dense retrieval sub-system is a neural-network which learns from Wikipedia data to encode the citation context into a dense query vector. This vector is then matched against the vector encodings of all passages in Sphere and the closest ones are returned.
The verification engine then ranks the candidate documents and the original citation with reference to the claim. A neural network takes the claim and a document as input, and predicts how well it supports the claim. Due to efficiency reasons, it operates on a per passage level and calculates the verification score of a document as the maximum over its per-passage scores. The verification scores are calculated by a fine-tuned BERT transformer that uses the concatenated claim and passage as input.
In other words, the model creates and compares mathematical representations of the meanings of entire statements rather than of individual words. Because webpages can contain long stretches of text, the models assess content in chunks and consider only the most relevant passage when deciding whether to recommend a URL.
The indices pass potential sources to an evidence-ranking model, which compares the new text with the original citation. Using fine-grained language comprehension, the model ranks the cited source and the retrieved alternatives according to the likelihood that they support the claim. If the original citation is not ranked above the candidate documents, then a new citation from the retrieved candidates is suggested.
Sphere was tested on the Knowledge Intensive Language Tasks benchmark, and surpassed the state of the art on two.
A computer system that has a human-level comprehension of language isn’t yet designed, but projects like this, which teach algorithms to understand dense material with an ever-higher degree of sophistication, help AI make sense of the real world. Meta AI’s research and advancements team says the goal of this work is to build a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the content of the corresponding article at scale. SIDE is open sourced and can be tested here.