segmentation is a well-liked activity in pc imaginative and prescient, with the purpose of partitioning an enter picture into a number of areas, the place every area represents a separate object.
A number of basic approaches from the previous concerned taking a mannequin spine (e.g., U-Web) and fine-tuning it on specialised datasets. Whereas fine-tuning works nicely, the emergence of GPT-2 and GPT-3 prompted the machine studying neighborhood to steadily shift focus towards the event of zero-shot studying options.
Zero-shot studying refers back to the potential of a mannequin to carry out a activity with out having explicitly obtained any coaching examples for it.
The zero-shot idea performs an necessary function by permitting the fine-tuning part to be skipped, with the hope that the mannequin is clever sufficient to resolve any activity on the go.
Within the context of pc imaginative and prescient, Meta launched the broadly identified general-purpose “Segment Anything Model” (SAM) in 2023, which enabled segmentation duties to be carried out with respectable high quality in a zero-shot method.
Whereas the large-scale outcomes of SAM have been spectacular, a number of months later, the Chinese language Academy of Sciences Picture and Video Evaluation (CASIA IVA) group launched the FastSAM mannequin. Because the adjective “quick” suggests, FastSAM addresses the pace limitations of SAM by accelerating the inference course of by as much as 50 occasions, whereas sustaining excessive segmentation high quality.
On this article, we’ll discover the FastSAM structure, potential inference choices, and look at what makes it “quick” in comparison with the usual SAM mannequin. As well as, we’ll take a look at a code instance to assist solidify our understanding.
As a prerequisite, it’s extremely really useful that you’re acquainted with the fundamentals of pc imaginative and prescient, the YOLO mannequin, and perceive the purpose of segmentation duties.
Structure
The inference course of in FastSAM takes place in two steps:
- All-instance segmentation. The purpose is to provide segmentation masks for all objects within the picture.
- Immediate-guided choice. After acquiring all potential masks, prompt-guided choice returns the picture area akin to the enter immediate.

Allow us to begin with the all occasion segmentation.
All occasion segmentation
Earlier than visually inspecting the structure, allow us to discuss with the unique paper:
“FastSAM structure is predicated on YOLOv8-seg — an object detector outfitted with the occasion segmentation department, which makes use of the YOLACT technique” — Fast Segment Anything paper
The definition may appear advanced for many who usually are not acquainted with YOLOv8-seg and YOLACT. In any case, to raised make clear the that means behind these two fashions, I’ll present a easy instinct about what they’re and the way they’re used.
YOLACT (You Solely Take a look at CoefficienTs)
YOLACT is a real-time occasion segmentation convolutional mannequin that focuses on high-speed detection, impressed by the YOLO mannequin, and achieves efficiency corresponding to the Masks R-CNN mannequin.
YOLACT consists of two fundamental modules (branches):
- Prototype department. YOLACT creates a set of segmentation masks referred to as prototypes.
- Prediction department. YOLACT performs object detection by predicting bounding packing containers after which estimates masks coefficients, which inform the mannequin how you can linearly mix the prototypes to create a closing masks for every object.

To extract preliminary options from the picture, YOLACT makes use of ResNet, adopted by a Characteristic Pyramid Community (FPN) to acquire multi-scale options. Every of the P-levels (proven within the picture) processes options of various sizes utilizing convolutions (e.g., P3 accommodates the smallest options, whereas P7 captures higher-level picture options). This strategy helps YOLACT account for objects at varied scales.
YOLOv8-seg
YOLOv8-seg is a mannequin primarily based on YOLACT and incorporates the identical rules relating to prototypes. It additionally has two heads:
- Detection head. Used to foretell bounding packing containers and lessons.
- Segmentation head. Used to generate masks and mix them.
The important thing distinction is that YOLOv8-seg makes use of a YOLO spine structure as a substitute of the ResNet spine and FPN utilized in YOLACT. This makes YOLOv8-seg lighter and sooner throughout inference.
Each YOLACT and YOLOv8-seg use the default variety of prototypes ok = 32, which is a tunable hyperparameter. In most eventualities, this supplies an excellent trade-off between pace and segmentation efficiency.
In each fashions, for each detected object, a vector of dimension ok = 32 is predicted, representing the weights for the masks prototypes. These weights are then used to linearly mix the prototypes to provide the ultimate masks for the article.
FastSAM structure
FastSAM’s structure is predicated on YOLOv8-seg but additionally incorporates an FPN, much like YOLACT. It consists of each detection and segmentation heads, with ok = 32 prototypes. Nevertheless, since FastSAM performs segmentation of all potential objects within the picture, its workflow differs from that of YOLOv8-seg and YOLACT:
- First, FastSAM performs segmentation by producing ok = 32 picture masks.
- These masks are then mixed to provide the ultimate segmentation masks.
- Throughout post-processing, FastSAM extracts areas, computes bounding packing containers, and performs occasion segmentation for every object.

Word
Though the paper doesn’t point out particulars about post-processing, it may be noticed that the official FastSAM GitHub repository makes use of the tactic cv2.findContours() from OpenCV within the prediction stage.
# Using cv2.findContours() technique the throughout prediction stage.
# Supply: FastSAM repository (FastSAM / fastsam / immediate.py)
def _get_bbox_from_mask(self, masks):
masks = masks.astype(np.uint8)
contours, hierarchy = cv2.findContours(masks, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
x1, y1, w, h = cv2.boundingRect(contours[0])
x2, y2 = x1 + w, y1 + h
if len(contours) > 1:
for b in contours:
x_t, y_t, w_t, h_t = cv2.boundingRect(b)
# Merge a number of bounding packing containers into one.
x1 = min(x1, x_t)
y1 = min(y1, y_t)
x2 = max(x2, x_t + w_t)
y2 = max(y2, y_t + h_t)
h = y2 - y1
w = x2 - x1
return [x1, y1, x2, y2]
In apply, there are a number of strategies to extract occasion masks from the ultimate segmentation masks. Some examples embody contour detection (utilized in FastSAM) and related part evaluation (cv2.connectedComponents()).
Coaching
FastSAM researchers used the identical SA-1B dataset because the SAM builders however educated the CNN detector on solely 2% of the info. Regardless of this, the CNN detector achieves efficiency corresponding to the unique SAM, whereas requiring considerably fewer sources for segmentation. In consequence, inference in FastSAM is as much as 50 occasions sooner!
For reference, SA-1B consists of 11 million numerous pictures and 1.1 billion high-quality segmentation masks.
What makes FastSAM sooner than SAM? SAM makes use of the Imaginative and prescient Transformer (ViT) structure, which is understood for its heavy computational necessities. In distinction, FastSAM performs segmentation utilizing CNNs, that are a lot lighter.
Immediate guided choice
The “section something activity” includes producing a segmentation masks for a given immediate, which will be represented in several types.

Level immediate
After acquiring a number of prototypes for a picture, a degree immediate can be utilized to point that the article of curiosity is positioned (or not) in a particular space of the picture. In consequence, the required level influences the coefficients for the prototype masks.
Just like SAM, FastSAM permits deciding on a number of factors and specifying whether or not they belong to the foreground or background. If a foreground level akin to the article seems in a number of masks, background factors can be utilized to filter out irrelevant masks.
Nevertheless, if a number of masks nonetheless fulfill the purpose prompts after filtering, masks merging is utilized to acquire the ultimate masks for the article.
Moreover, the authors apply morphological operators to clean the ultimate masks form and take away small artifacts and noise.
Field immediate
The field immediate includes deciding on the masks whose bounding field has the very best Intersection over Union (IoU) with the bounding field specified within the immediate.
Textual content immediate
Equally, for the textual content immediate, the masks that greatest corresponds to the textual content description is chosen. To attain this, the CLIP model is used:
- The embeddings for the textual content immediate and the ok = 32 prototype masks are computed.
- The similarities between the textual content embedding and the prototypes are then calculated. The prototype with the very best similarity is post-processed and returned.

Typically, for many segmentation fashions, prompting is often utilized on the prototype degree.
FastSAM repository
Under is the hyperlink to the official repository of FastSAM, which features a clear README.md file and documentation.
In case you plan to make use of a Raspberry Pi and need to run the FastSAM mannequin on it, make sure you try the GitHub repository: Hailo-Application-Code-Examples. It accommodates all the mandatory code and scripts to launch FastSAM on edge units.
On this article, now we have checked out FastSAM — an improved model of SAM. Combining the very best practices from YOLACT and YOLOv8-seg fashions, FastSAM maintains excessive segmentation high quality whereas attaining a major increase in prediction pace, accelerating inference by a number of dozen occasions in comparison with the unique SAM.
The flexibility to make use of prompts with FastSAM supplies a versatile strategy to retrieve segmentation masks for objects of curiosity. Moreover, it has been proven that decoupling prompt-guided choice from all-instance segmentation reduces complexity.
Under are some examples of FastSAM utilization with totally different prompts, visually demonstrating that it nonetheless retains the excessive segmentation high quality of SAM:


Sources
All pictures are by the writer until famous in any other case.