A group of researchers from the Chinese Academy of Sciences and Monash University have presented a new approach to text input generation for mobile app testing based on a pre-trained large language model (LLM). Dubbed QTypist, the approach was evaluated on 106 Android apps and automated test tools, showing a significant improvement of testing performance.
One key hindrance to automating mobile app testing is the need for text input generation, say the researchers, which can be challenging even for human testers. This is a consequence of the fact that different categories of inputs may be required, including geolocation, addresses, health measures, as well as of the relationship that may exist between different inputs required on successive input pages and resulting in validation constraints. Furthermore, as one the of paper authors explains on Twitter, the input provided on one app view will determine which other views will be presented.
Large language models (LLMs) such as BERT and GPT-3 have been shown to be able to write essays, to answer questions, and to generate source code. QTypist attempts to leverage the ability of LLMs to understand input prompts taken from a mobile app to generate meaningful output to be used as a text input for the app.
Given a GUI page with text input and its corresponding view hierarchy file, we first extract the context information for the text input, and design linguistic patterns to generate prompts for inputting into the LLM. To boost the performance of LLM in mobile input scenarios, we develop a prompt-based data construction and tuning method, which automatically builds the prompts and answers for model tuning.
In a first step, QTypist extracts context information for a GUI view using a GUI testing tool, including metadata associated to input widgets, e.g. user hints, context information related to close-by widgets, and global context such as the activity name.
The prompt generation step relies on the three categories of extracted information to build a prompt based on a number of patterns defined by linguistic authors working on a set of 500 reference apps.
This process comes out with 14 linguistic patterns respectively related to input widget, local context and global context [...]. The patterns of the input widget explicitly specify what should be input into the widget, and we employ the keywords like noun (widget[n]), verb (widget[v]) and preposition (widget[prep]) for designing the pattern.
The prompt dataset is finally used as an input to GPT-3, whose output is used as input content. The effectiveness of this approach was evaluated comparing it to the baselines of a number of alternative approaches, including DroidBot, Humanoid, and others, as well as human assessment of the quality of generated input. Additionally, the researchers carried through a usefulness evaluation against 106 Android apps available in Google Play by integrating QTypist with automated test tools. In all cases, they say, QTypist was able to improve the performance of existing approaches.
While the initial work carried through by the team of researchers behind QTypist shows promises, more work is required to extend it to cases where the app does not provide enough context information as well as to apply it to cases going beyond GUI testing.