In a recent press release, Amazon announced the general availability of Amazon Textract, a fully managed, machine learning service that extracts content from text and structured document data. Using Amazon Textract, customers can automate document workflows, index and catalog important information for use in downstream applications. The service is capable of processing millions of document pages in a few hours.
Amazon is looking to democratize intelligent document extraction to drive positive business outcomes. Swami Sivasubramanian, vice president, Amazon Machine Learning, explains:
The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required. In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions.
Amazon Textract goes beyond traditional optical character recognition (OCR) techniques to identify key fields or content. Instead, file formats such as PDFs, images, text and tables can be extracted using Textract APIs and then passed to Amazon machine learning services such as Amazon Comprehend, Amazon Comprehend Medical, and Amazon Translate to extract content in a more intelligent manner.
The data extracted from Textract comes in a JSON format and includes metadata such as the page number, section, label and data type. Both the content and metadata can then be loaded into a databases and analytics services including Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena for consumption by other applications in areas such as accounting, auditing and compliance.
To guage the accuracy of a data extraction process, Textract returns a confidence score, represented as a percentage, for each data attribute that it identifies. This allows a developer to flag inaccuracies and route this information to a human for further validation. In addition, bounding box coordinates are provided to identify specifically where the data was extracted from.
Amazon already has customers using the Textract service including organizations such as PwC, Healthfirst, Informed Inc, UiPath and The Global and Mail. The Globe and Mail have used Textract to improve the productivity of their journalists and taking advantage of their vast datasets which they have previously underutilized. Michael O’Neill, managing director of Digital and Data Science at The Globe and Mail, explains:
As a news media company, we rely on many PDF or scanned-source documents such as FOIs (freedom of information requests) that have important information contained in tables that we previously couldn't access. These documents have been under-utilized because journalists were not able to access them easily or didn't know they existed. Using Amazon Textract, we are able to extract information from tables in PDFs and easily output that data to CSV and offer easy access to these documents by making them available for search queries by our journalists. This increases efficient access to information for our journalist by tenfold.
For additional information about Amazon Textract, please refer to their in-product documentation.