March 14, 2025

ikayaniaamirshahzad@gmail.com

SAM 2: Meta’s Next-Gen Model for Video and Image Segmentation


The era has arrived where your phone or computer can understand the objects of an image, thanks to technologies like YOLO and SAM.

Meta’s Segment Anything Model (SAM) can instantly identify objects in images and separate them without needing to be trained on specific images. It’s like a digital magician, able to understand each object in an image with just a wave of its virtual wand. After the successful release of llama 3.1, Meta announced SAM 2 on July 29th, a unified model for real-time object segmentation in images and videos, which has achieved state-of-the-art performance.

SAM 2 offers numerous real-world applications. For instance, its outputs can be integrated with generative video models to create innovative video effects and unlock new creative possibilities. Additionally, SAM 2 can enhance visual data annotation tools, speeding up the development of more advanced computer vision systems.

SAM 2 involves a task, a model, and data

SAM 2 involves a task, a model, and data (Image Source)

Segment Anything (SAM) introduces an image segmentation task where a segmentation mask is generated from an input prompt, such as a bounding box or point indicating the object of interest. Trained on the SA-1B dataset, SAM supports zero-shot segmentation with flexible prompting, making it suitable for various applications. Recent advancements have improved SAM’s quality and efficiency. HQ-SAM enhances output quality using a High-Quality output token and training on fine-grained masks. Efforts to increase efficiency for broader real-world use include EfficientSAM, MobileSAM, and FastSAM. SAM’s success has led to its application in fields like medical imaging, remote sensing, motion segmentation, and camouflaged object detection.

Many datasets have been developed to support the video object segmentation (VOS) task. Early datasets feature high-quality annotations but are too small for training deep learning models. YouTube-VOS, the first large-scale VOS dataset, covers 94 object categories across 4,000 videos. As algorithms improved and benchmark performance plateaued, researchers increased the VOS task difficulty by focusing on occlusions, long videos, extreme transformations, and both object and scene diversity. Current video segmentation datasets lack the breadth needed to ” segment anything in videos,” as their annotations typically cover entire objects within specific classes like people, vehicles, and animals. In contrast, the recently introduced SA-V dataset focuses not only on whole objects but also extensively on object parts, containing over an order of magnitude more masks. The SA-V dataset collected comprises of 50.9K videos with 642.6K masklets.

Example videos from the SA-V dataset with masklets

Example videos from the SA-V dataset with masklets (Image Source)

The model extends SAM to work with both videos and images. SAM 2 can use point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented throughout the video. When processing images, the model operates similarly to SAM. A lightweight, promptable mask decoder takes a frame’s embedding and any prompts to generate a segmentation mask. Prompts can be added iteratively to refine the masks.

Unlike SAM, the frame embedding used by the SAM 2 decoder isn’t taken directly from the image encoder. Instead, it’s conditioned on memories of past predictions and prompts from previous frames, including those from “future” frames relative to the current one. The memory encoder creates these memories based on the current prediction and stores them in a memory bank for future use. The memory attention operation uses the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is passed to the mask decoder.

SAM 2 Architecture

SAM 2 Architecture. In each frame, the segmentation prediction is based on the current prompt and any previously observed memories. Videos are processed in a streaming manner, with frames being analyzed one at a time by the image encoder, which cross-references memories of the target object from earlier frames. The mask decoder, which can also use input prompts, predicts the segmentation mask for the frame. Finally, a memory encoder transforms the prediction and image encoder embeddings (not shown in the figure) for use in future frames. (Image Source)

Here’s a simplified explanation of the different components and processes present in the image:

Image Encoder

Memory Attention

Prompt Encoder and Mask Decoder

Memory Encoder and Memory Bank

Training

Source link

Leave a Comment