- MIT researchers develop an automated text-generating system to spot and replace outdated information in Wikipedia pages.
- The tool rewrites sentences while retaining humanlike grammar and structure.
Online encyclopediae contain millions of sentences that need frequent updates and corrections. Wikipedia, for example, consists of more than 40 million articles in over 300 different languages. The English Wikipedia alone has over 3.5 billion words in 6 million articles.
Millions of those articles are time-sensitive articles, which means they must be continually updated. Some updates involve modification of content, while others require expansions of existing articles.
MIT researchers have focused on the former scenario where the modification contradicts the existing articles. They have developed a text-generating tool that can automatically rewrite outdated sentences and update factual inconsistencies in Wikipedia articles.
It can spot and replace specific information in relevant articles. What’s more impressive is the tool can rewrite sentences in the same manner humans write and edit.
This will reduce the effort and time human editors currently spend on doing routine modifications such as updating sentences, names, locations, dates, and numbers. Instead of hundreds of people modifying each Wikipedia article, only a few will be needed.
How Does It Work?
The tool provides an interface for humans to type unstructured sentences with updated data, without worrying about grammar or style. It then automatically pinpoints the relevant Wikipedia page and outdated sentence.
The output must be consistent with the new data and fit into the rest of the existing article. Researchers proposed a two-step solution to solve this constrained generation task. It involves taking an outdated sentence as input from a Wikipedia page, and a separate claim sentence comprised of updated and conflicting data.
- Detect and delete the contradicting elements in the target text for a given claim.
- Expand the remaining text to be consistent with the given claim.
The first step is achieved by using a neutralizing stance model, and the second step is achieved by using a unique two-encoder sequence-to-sequence method with copy attention.
The fact-guided update pipeline
Researchers evaluated the model on ‘SARI’ score, which shows how well machines add, keep, and delete words compared to the way human editors write sentences. They found that the new model accurately updated factual information: it performed way better than existing text-generating techniques and outputs generated by this new model more closely resembled humanlike grammar and style.
The performance of the model was also tested by crowdsourced humans. It achieved an average score of 3.85 (out of 5) in matching grammar and 4 in factual updates.
The findings also show that the model can be used to augment datasets to remove bias when training ‘fake news’ detectors. In this study, researchers were able to reduce the error rate of a standard fake news detector by 13%, using their augmented dataset (without manually gathering additional information).
In the coming years, researchers will try to develop a fully automated model that can identify and use the latest information from the internet to generate rewritten sentences in relevant Wikipedia pages that reflect updated information.