Meta AI has introduced the Segment Anything Model (SAM), aiming to democratize image segmentation by introducing a new task, dataset, and model. The project features the Segment Anything Model (SAM) and the Segment Anything 1-Billion mask dataset (SA-1B), which is the most extensive segmentation dataset to date.
Using an efficient model within a data collection loop, Meta AI researchers have constructed the largest segmentation dataset thus far, containing over 1 billion masks on 11 million licensed and privacy-respecting images.
The model has been purposefully designed and trained to be promptable, enabling zero-shot transfer to new image distributions and tasks. Following evaluation of the model's capabilities across numerous tasks, it has been determined that its zero-shot performance is impressive, often comparable to, or surpassing, previous fully supervised outcomes.
Up until now, there have been two categories of methods available for solving segmentation problems. The first approach is interactive segmentation, which permits the segmentation of any object category but relies on human guidance to refine a mask iteratively. The second approach is automatic segmentation, which allows for the segmentation of predefined object categories (such as chairs or cats), but requires a significant amount of manually annotated objects to be trained (sometimes in the range of thousands or tens of thousands of examples of segmented cats), in addition to substantial compute resources and technical expertise to train the segmentation model. These two methods failed to provide a universal, fully automated approach to segmentation.
SAM represents a synthesis of these two approaches. It is a single model that can effectively handle both interactive and automatic segmentation tasks. The model's promptable interface allows for versatility in its usage, making it suitable for a wide range of segmentation tasks by engineering the right prompt for the model, such as clicks, boxes, or text. Additionally, SAM is trained on a diverse and high-quality dataset of over 1 billion masks, which was collected as part of the project. This enables it to generalize well to new types of objects and images beyond what it was trained on. This ability to generalize significantly reduces the need for practitioners to collect their own segmentation data and fine-tune a model for their specific use case.
According to Meta, their goal is to facilitate further advancements in segmentation and image and video understanding by sharing their research and dataset. Their promptable segmentation model has the ability to act as a component in a larger system, allowing it to perform segmentation tasks. This composition approach enables a single model to be used in a variety of extensible ways, potentially leading to the accomplishment of tasks that were unknown at the time of model design.
With prompt engineering techniques, it is anticipated that the SAM model can be used as a powerful component in a variety of domains, including AR/VR, content creation, scientific research, and more general AI systems.