Auto labeling with Segment Anything (SAM / SAM2 / SAM2.1 / SAM3 / MobileSAM)
AnyLabeling supports the full Segment Anything family:
-
Segment Anything Model (SAM) (opens in a new tab) is a foundation segmentation model from Meta. Trained with 11M images and 1B segmentation masks, it can segment objects in the image without being trained on specific objects. For this reason, Segment Anything is a good candidate for auto labeling, even for new objects. Available as ViT-B, ViT-L, and ViT-H (and quantized variants).
-
Segment Anything Model 2 (SAM 2) (opens in a new tab) is Meta's advancement in computer vision, building upon the success of its predecessor. This foundation model is designed to tackle promptable visual segmentation in both images and videos. Available in Hiera-Tiny, Hiera-Small, Hiera-Base+, and Hiera-Large sizes.
-
Segment Anything Model 2.1 (SAM 2.1) is an improved version of SAM 2 with better accuracy and robustness. Available in the same four sizes as SAM 2.
-
Segment Anything Model 3 (SAM 3) extends the SAM family with open-vocabulary, text-driven segmentation. In addition to point and rectangle prompts, SAM 3 accepts text prompts (e.g., "truck", "person") to detect and segment objects without any prior training on those classes. Available as ViT-H.
-
MobileSAM (opens in a new tab) is the lightweight variant introduced in Faster Segment Anything: Towards Lightweight SAM for Mobile Applications (opens in a new tab). Optimized for speed on CPU.
Video for AnyLabeling v0.2.6
Supported Models
| Model | Prompt Types | Notes |
|---|---|---|
| SAM ViT-B / ViT-L / ViT-H | Point, Rectangle | Original SAM; ViT-H is most accurate |
| SAM ViT-B Quant / ViT-L Quant / ViT-H Quant | Point, Rectangle | Quantized (faster, smaller) variants |
| MobileSAM | Point, Rectangle | Lightweight; fast on CPU |
| SAM 2 Hiera-Tiny / Small / Base+ / Large | Point, Rectangle | Meta SAM 2 |
| SAM 2.1 Hiera-Tiny / Small / Base+ / Large | Point, Rectangle | Improved SAM 2 |
| SAM 3 ViT-H | Text, Point, Rectangle | Open-vocabulary; text drives detection |
All models are downloaded automatically on first use from Hugging Face.
Instructions
- Select Brain Button on the left side to activate auto labeling.
- Select one of the Segment Anything Models from the dropdown menu Model. The model accuracy and speed differ depending on the variant:
- MobileSAM — fastest, lowest accuracy
- SAM ViT-B — fast, good for most use cases
- SAM ViT-H — slowest, highest accuracy
- SAM 2 / SAM 2.1 — improved segmentation quality
- SAM 3 — open-vocabulary, supports text prompts
- Quant suffix indicates a quantized model (faster, smaller file size)
- Use Auto segmentation marking tools to mark the object.
- +Point: Add a point that belongs to the object.
- -Point: Remove a point that you want to exclude from the object.
- +Rect: Draw a rectangle that contains the object. Segment Anything will automatically segment the object.
- Text (SAM 3 only): Type a text description (e.g., "truck") to detect and segment matching objects automatically.
- Clear: Clear all auto segmentation markings.
- Finish Object (f): Finish the current marking. After finishing the object, you can enter the label name and save the object.
Note
- In the first time of running any model, AnyLabeling needs to download the model from the server. Therefore, it may takes a while depending on the network speed.
- The first AI inference also takes time. Please be patient.
- A background task is running to calculate "encoder" for Segment Anything Model. Therefore, it may take a shorter time for auto segmentation in the next images.
Integration of SAM into AnyLabeling
Segment Anything Model is divided into two parts: a heavy Encoder and a light-weight Decoder. The Encoder extracts image embedding from an input image. Based on the embedding, and input prompt (points, box, masks), the Decoder produces output mask(s). The decoder can run in single or multiple-mask mode.
Segment Anything in AnyLabeling
In the web demo, Meta runs the Encoder in their server, and the Decoder can be run in real time in the users' browser, where they can input points, and boxes and receive the output immediately. In AnyLabeling, we also run Encoder only once for each image. After that, based on the changes in the prompt from the user (points, boxes), the Decoder is run to produce an output mask. We added Postprocessing step to find the contours and produce shapes (polygons, rectangles, etc.) for labeling.
To reduce the dependencies, instead of using segment_anything (opens in a new tab) package from Meta, we rewrote the code to use ONNX Runtime and NumPy only. The source code for running ONNX model can be found here (opens in a new tab) and the ONNX models were uploaded to AnyLabeling Assets (opens in a new tab).
Original from: https://www.vietanh.dev/blog/2023-04-22-create-a-segment-anything-labeling-tool-any-labeling (opens in a new tab).