Google Research recently developed ScreenAI, a multimodal AI model for understanding infographics and user interfaces. ScreenAI is based on the PaLI architecture and achieves state-of-the-art performance on several tasks.
ScreenAI is pre-trained on a dataset of screenshots generated by crawling the web and automatically interacting with apps. The researchers used several off-the-shelf models AI to generate synthetic training data, including OCR to annotate the screenshots and an LLM to generate user questions about the screenshots. After pretraining and fine-tuning, the result is a five billion parameter model that can answer questions about UI screens and infographics, as well as summarize or navigate screens. ScreenAI set new performance records on the WebSRC and MoTIF benchmarks, and outperformed other similarly-sized models on the Chart QA, DocVQA, and InfographicVQA benchmarks. To help the wider research community in developing and evaluating similar models, Google released three new evaluation datasets for screen-based question-answering (QA) models. According to Google:
While our model is best-in-class, we note that, on some tasks, further research is needed to bridge the gap with models like GPT-4 and Gemini, which are orders of magnitude larger. To encourage further research, we re- lease a dataset with this unified representation, as well as two other datasets to enable more comprehensive benchmarking of models on screen-related tasks.
ScreenAI is based on the Pathways Language and Image model (PaLI) architecture, which combines a Vision Transformer (ViT) with an encoder-decoder Large Language Model (LLM), such as T5. The Google team made a key modification to this base architecture. Because UIs and infographics often have a "wide variety of resolutions and aspect ratios," they modified the image patching step of the ViT to use the patching strategy from the Pix2Struct model. This allows the model to adjust the patch grid according to the input image's shape.
To generate pretraining data, the researchers first created an automated annotation pipeline. This system, given a screenshot image, can detect and classify UI and infographic elements such as images, pictograms, text, and buttons. The result is a screen schema annotation, which lists the UI elements along with bounding boxes indicating their location within the screen.
The screen schema data is then used to generate synthetic training data. The team fed the schema to an LLM along with a prompt telling the LLM the schema represents a screenshot and asking the LLM to generate questions a human user might ask about the screenshot. The researchers also had the LLM generate summarizations of the screenshots. Overall, the final dataset contained around 400M samples.
To evaluate the model, the researchers fine-tuned it on several publicly available datasets for navigation, summarization, and QA. They compared the model's performance to state-of-the-art (SOTA) as well as to other models with 5B parameters or fewer. In addition to setting new SOTA performance on two benchmarks and outperforming other 5B parameter models on three benchmarks, it was "competitive" on two additional benchmarks.
Several users of X posted their thoughts about ScreenAI. One wondered whether Google might use the model for ranking search results. Another wrote:
The competition is heating up. GPT-4 Vision already faces strong competition from Qwen-VL-Max, and now it seems Google is entering the arena with ScreenAI. Google's entry better impress!
Although Google has not released the model code or weights, they have open-sourced their evaluation datasets ScreenQA and Screen Annotation on GitHub.