Google AI has just released a new version (V6) of their photo dataset Open Images, which now includes an entirely new type of annotation called localized narratives. These multimodal descriptions of images incorporate synchronized voice, text, and mouse trace annotations that provide more in-depth training data for what is already one of the largest open "source annotated" image datasets in the world.
Since the initial release in 2016, the Open Images dataset has grown to hold over 9 million annotated images, and V6 adds millions of new data points to the already impressive corpus. It has become widely adopted by the computer vision community, especially by those working on object recognition and autonomous driving. The ultimate goal of Open Images V6 is to aid progress towards genuine scene understanding by unifying the dataset for image classification, object detection, visual relationship detection, instance segmentation, and multimodal image descriptions. Along with the data, Open Images Challenges are held yearly in order to push the boundaries of common computer vision tasks. Google research scientist Jordi Pont-Tuset notes:
"Along with the data set itself, the associated Open Images challenges have spurred the latest advances in object detection, instance segmentation, and visual relationship detection".
Localized narratives provide spoken descriptions of images which are grounded by the annotator's mouse which hovers over each region of the image that they are describing. Since the voice and the mouse pointer are synchronized, every word in the description can be localized. The mouse traces are seen as a more natural way for humans to provide a sequence of grounding locations, compared to the current standard of listing bounding boxes.
Here is an example of an annotator in action.
V6 provides 507,000 localized narratives like the one shown in the above example. The total length of mouse traces is ~6400 kilometres long, and if the narratives were to be read aloud continuously, it would ~1.5 years to listen to.
The release also includes 1,400 more visual relationship annotations (e.g. "man riding a skateboard") and 25 million annotations of humans performing standalone actions.
The process for obtaining the localized narrative data is slightly different than that for the other Open Images annotation types.
The annotations are in JSON Lines format. Here is a sample annotation:
{
dataset_id: 'mscoco_val2017',
image_id: '137576',
annotator_id: 93,
caption: 'In this image there are group of cows standing and eating th...',
timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...],
traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...],
voice_recording: 'coco_val/coco_val_137576_93.ogg'
}
To get started with V6, navigate to the GitHub repository to download the data and use the code.
Google envisions that a wide range of research will benefit from the improved data set. Some areas include assistive technology for the visually impaired, supervised CV training, image generation, image retrieval, grounded speech recognition, and voice-driven environment navigation.