NIST Launches Program to Discriminate How Far from "Human-Quality" are Gen AI Generated Summaries

The US National Institute of Standards and Technology (NIST) launched a public generative AI evaluation program developed by the international research community. The pilot program focuses on text-to-text and text-to-image. The general objectives include but are not limited to evolving benchmark dataset creation, multi-modal authenticity detection, comparative analysis and fake or misinformation source detection. The first-round submission deadline is August.

The pilot aims to measure and understand system behaviours for discriminating between synthetic and human-generated content in text-to-text (T2T) and text-to-image (T2I) modalities, mainly to respond to the following: "How does human content differ from synthetic content?" and "How can users differentiate between the two?"

Teams can act as generators, discriminators or both. Generator teams will be evaluated on their system’s ability to generate synthetic content as close as possible to humans. The discriminator teams will be evaluated on their system’s ability to detect synthetic content created by generative AI (LLMs and deep fake tools).

The Text-To-Text-Discriminators (T2T-D) task will need to detect whether a targeted output summary was generated using generative AI. Each trial consists of a single summary, the T2T-D detection system must render a confidence score (any real number). The higher numbers indicating the target text summary is more likely to have been generated using LLM-based models. The primary metric for measuring detection performance will be the Area Under the Receiver Operating Characteristics (ROC) Curve (AUC).

The Text-to-Text Generators (T2T-G) task is designed to automatically generate high-quality summaries based on a "topic" (statement of information needed) and a set of targeted documents (about 25). The summary must answer the need for information expressed in the topic statement. Participants should assume that the target audience of the summary is a supervisory information analyst who will use it to inform decision-making. The submission will have to adhere to the following rules:

All processing of documents and generation of summaries must be automatic
The summary can be no longer than 250 words (whitespace-delimited tokens)
Summaries longer than the size limit will be truncated
No bonus will be given for creating shorter summaries
No specific formatting other than linear is allowed (e.g. plain text)

The test data for generator teams will include about 45 topics. This set of summaries from all generator teams will serve as the testing data for discriminator teams. The summary output will be evaluated by determining how easy or difficult it is to distinguish AI-generated summaries from human-generated summaries (the goal of generators is to output a summary that is indistinguishable from human-generated summaries).

The participants cannot use the test dataset for training, modelling, or tuning their algorithms. All machine learning or statistical analysis algorithms must complete training, model selection, and tuning before running their system on the available test data; learning/adaptation during processing is not permissible. Each participant can submit system output for evaluation only once per 24-hour period.

The first pilot is focused on text-to-text, and it will run throughout 2024. The platform is designed to support multiple modalities and technologies for teams from academia, industry, and other research labs. Those interested in participating can register on the program’s website until May 2025. The test phases are scheduled in June, September and November. After the evaluation closes in January 2025, the results will be released in February 2025, and a GenAI evaluation workshop will be organised in March 2025.

Other such contests are the generative AI hackathon organised by Google, the RTX developer challenge proposed by Nvidia, the generative AI competition organised by members from Harvard and AI for Life Sciences organised with support from the University of Vienna.

About the Author

Olimpiu Pop

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Olimpiu Pop

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter