Google has open-sourced their AI model for converting sequences of natural language instructions to actions in a mobile device UI. The model is based on the Transformer deep-learning architecture and achieves 70% accuracy on a new benchmark dataset created for the project.
A team of scientists from Google Research published a paper describing the model at the recent Association for Computational Linguistics (ACL) conference. The goal of the project is to help develop natural-language interfaces for mobile device users who are visually impaired or who temporarily need a "hands-free" mode. The system uses two Transformer models in sequence: the first to convert natural-language instructions to a series of "action phrases," and the second to "ground" the action phrases by matching them with on-screen UI objects. As research scientist Yang Li wrote in a blog post describing the project,
This work lays the technical foundation for task automation on mobile devices that would alleviate the need to maneuver through UI details, which may be especially valuable for users who are visually or situationally impaired
The Transformer is a deep-learning architecture for mapping input sequences to output sequences developed by Google in 2017. It has several advantages over other sequence-learning architectures, such recurrent neural networks (RNN), including more stability in training and faster inference; consequently, most state-of-the-art natural-language processing (NLP) systems are Transformer-based. The key operation in a Transformer is attention, which learns relationships between different parts of the input and output sequences. For example, in a Transformer trained to translate from one language to another, attention often learns the mapping of words in the source language to words in the target language.
In Google's new AI, one Transformer uses a form of attention called area attention to identify spans of adjacent words in the input instructions which are mapped to discrete actions: for example, "navigate to." This Transformer converts a sequence of input instructions in a natural language to a sequence of tuples that represent UI actions. Each tuple consists of an operation (such as "open" or "click"), a description of an object to operate on (such as "Settings" or "App Drawer"), and an optional parameter (for example, text that should be typed into a text box). To execute these actions, they must be grounded by identifying the correct UI object. This is done by a second Transformer; the inputs to this Transformer include both an action-phrase tuple and the set of UI objects currently on the device's screen. The Transformer learns to select an object based on the description from the action-phrase tuple.
To train the model, Google created two datasets---one for each Transformer. A dataset called AndroidHowTo for training the action-phrase extraction Transformer was collected by scraping the web for answers to "how-to" questions related to Android devices. Human annotators labelled the data by identifying action-phrase tuples in the answer instructions. The final dataset contains nearly 10k labelled instructions, representing 190k actions. For the grounding Transformer, the team generated a synthetic dataset called RicoSCA. Starting with a publicly-available dataset called Rico, which contains 72k UI screens for Android apps, the team randomly selected UI elements from screens and generated commands for them, such as "tap" or "click." The resulting dataset contains nearly 300k commands.
To evaluate overall performance of the system, the researchers created a dataset called PixelHelp, compiled from Pixel phone help pages. Human operators used Pixel phone emulators to perform the tasks described in the pages. A logger recorded their actions, which created a mapping of natural-language instruction to UI operations. The resulting dataset contains 187 multi-step instructions. The new AI was evaluated on this dataset, and achieved an accuracy of 70.59%.
Google's new AI is one of many efforts in natural-language automation of mobile devices. Apple introduced Siri Shortcuts in 2018, allowing users to define sequences of actions that can be triggered by a voice command. Amazon's Alex recently introduced the ability to automate apps that support deep-linking. Both the Siri and Alexa solutions require the apps to explicitly support them. By contrast, Google's AI learns to operate directly on the device UI, allowing it to be used with any app.
Google's model and dataset generation code are available on GitHub.