SigLIP 2 is a household of recent multilingual vision-language encoders that construct on SigLIP. The unique image-text coaching goal is prolonged with a number of prior, independently developed strategies right into a unified recipe. This contains captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and on-line knowledge curation. Variants which assist a number of resolutions and protect the enter’s native facet ratio are additionally educated. The fashions are educated on a extra numerous data-mixture that features de-biasing strategies. Mannequin checkpoints are launched at 4 sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
The unique SigLIP coaching recipe is mixed with decoder-based pretraining, along with self-distillation and masked prediction. Pretraining a picture encoder with a language decoder for captioning and referring expression comprehension was proven to enhance OCR capabilities and localization, whereas self-distillation and masked prediction results in higher options for dense prediction duties, zero-shot classification and retrieval. Fairly than combining all these strategies in a single run, a staged method is adopted.
Along with coaching a set of fashions and adapting every mannequin individually to completely different resolutions whereas distorting the facet ratio, variants which course of photographs whereas largely preserving their native facet ratio like NaViT and assist completely different sequence lengths as FlexiViT are additionally educated. This variant known as NaFlex.
Structure and Coaching Information
For the structure, SigLIP is adopted in order that current customers can merely swap out the encoder weights. Particularly, the fixed-resolution variant depends on the usual ViT structure with realized positional embedding. The identical structure is used for the picture and textual content tower, aside from the g-sized imaginative and prescient encoder which is paired with an So400m-sized textual content encoder. Imaginative and prescient and textual content representations are pooled utilizing a MAP head (consideration pooling). The textual content size is ready to 64 and the multilingual Gemma tokenizer with vocabulary dimension 256k is used, remodeling the textual content to decrease case earlier than tokenization.
The WebLI dataset containing 10 billion photographs and 12 billion alt-texts overlaying 109 languages is used. To strike a superb stability between high quality on English and multilingual vision-language benchmarks, the combination consists such that 90% of the coaching image-text pairs is sourced from English net pages, and the remaining 10% from non-English net pages.
Coaching with Sigmoid loss and decoder
In step one of pretraining, SigLIP is mixed with LocCa by combining the 2 losses with equal weight. Not like CLIP, which depends on a contrastive loss, SigLIP creates binary classification issues by combining each picture embedding with each textual content embedding within the mini-batch and trains the embeddings to categorise matching and non-matching pairs through sigmoid loss.
For LocCa, a normal transformer decoder with cross-attention is connected to the un-pooled imaginative and prescient encoder illustration. The decoder follows the shapes of the textual content encoder besides that cross-attention layers are added and the variety of layers is diminished by an element of two. Moreover picture captioning, LocCa additionally trains for automated referring expression prediction and grounded captioning. Referring expression prediction quantities to predicting bounding field coordinates for captions describing particular picture areas, whereas grounded captioning includes predicting region-specific captions given bounding field coordinates.
For all mannequin sizes, the imaginative and prescient encoder patch dimension is ready to 16 and the picture decision to 256.
Coaching with self-distillation and masked prediction
The coaching setup is augmented with local-to-global correspondence studying with self-distillation and masked prediction losses to enhance the native semantics of the (un-pooled) function illustration. This illustration is usually used for dense prediction duties like segmentation, depth estimation and many others. Concretely, two phrases are added to the losses described above.
The primary time period is the local-to-global consistency loss, through which the imaginative and prescient encoder turns into the coed community, which will get a partial (native) view of the coaching picture, and is educated to match the instructor’s illustration, derived from the complete picture. This auxiliary matching activity is carried out in a high-dimensional function house computed with a separate MLP head. As is widespread within the literature, the instructor parameters are obtained as an exponential shifting common of the coed parameters over the earlier iterations.
The second loss time period is the masked prediction goal. 50% of the embedded picture patches within the pupil community are changed with masks tokens and the coed is educated to match the options of the instructor at masked areas. The loss is then outlined identically to the primary time period (consistency loss), however utilized to per-patch options somewhat than the pooled, image-level illustration.
These losses are added at 80% of coaching completion, initializing the instructor with the coed parameters and the remaining extra parameters (heads, masks token and corresponding optimizer parameters) randomly. The weights of the primary and the second loss phrases are set to 1 and 0.25. Additional, to stability mannequin high quality on world/semantic and dense duties, the 2 loss phrases are re-weighted by one other issue of 0.25, 0.5, 1.0, and 0.5 for the B, L, So400m and g, mannequin sizes, respectively.
Adaptation to completely different resolutions
Fastened-resolution variant
To acquire fixed-resolution checkpoints at a number of resolutions, checkpoints are resumed at 95% of coaching. The positional embedding is resized to the goal sequence size and coaching is resumed on the goal decision with all losses.
Variable facet and determination (NaFlex)
NaFlex combines concepts from FlexiViT, i.e. sup- porting a number of, predefined sequence lengths with a single ViT mannequin, and NaViT, specifically processing photographs at their native facet ratio.
Given a patch dimension and goal sequence size, NaFlex preprocesses the information by first resizing the enter picture such that the peak and width after resizing are multiples of the patch dimension, whereas maintaining the facet ratio distortion as small as doable and producing a sequence size of at most the specified goal sequence size. The ensuing distortion in width and peak is at most (patch_size-1)/width and (patch_size-1)/peak, respectively, which tends to be small for widespread resolutions and facet ratios.
After resizing, the picture is break up right into a sequence of patches, and patch coordinates in addition to a masks with padding info is added (to deal with the case the place the precise sequence size is smaller than the goal size).
Distillation through lively knowledge curation
To maximise efficiency of the smallest fixed-resolution fashions (ViT-B/16 and ViT-B/32), information is distilled from a instructor mannequin throughout a brief fine-tuning stage. These fashions are continued coaching for a further 4B examples utilizing simply the sigmoid image-text loss. Throughout this stage, implicit “distillation by knowledge” utilizing the ACID technique is carried out at each coaching step. At each coaching step, the instructor mannequin and the present learner mannequin are used to attain examples by their learnability.
Zero-shot classification and retrieval
efficiency (recall@1).
- SigLIP 2 outperforms SigLIP and different open-weight baselines on widespread zero-shot classification and retrieval benchmarks. This enchancment is especially noticeable for B-sized fashions resulting from distillation.
- SigLIP 2 demonstrates sturdy multilingual retrieval efficiency on the Crossmodal-3600 benchmark, considerably exceeding SigLIP’s recall and approaching mSigLIP’s efficiency.
- The NaFlex variant of SigLIP 2 typically outperforms the usual variant on retrieval benchmarks, particularly for shorter sequence lengths and resolutions the place facet ratio distortion is extra impactful.
- On benchmarks with predominantly pure photographs, the usual B-sized SigLIP 2 performs higher than the NaFlex variant, seemingly resulting from distillation, whereas the 2 variants carry out equally for the So400m structure.
SigLIP 2 as a imaginative and prescient encoder for VLMs
Imaginative and prescient encoders are mixed with the Gemma 2 2B LLM. The LLM is educated on 50M examples of a multimodal dataset involving varied vision-language duties (captioning, OCR, grounded captioning, visible query answering, detection, and occasion segmentation). The imaginative and prescient encoder is saved frozen throughout coaching. Experiments are carried out with enter resolutions of 224/256px and 384px. Stage 1 coaching is repeated at 384px. The ensuing VLM was fine-tuned on a variety of downstream duties.
- SigLIP 2 outperforms SigLIP throughout completely different resolutions and mannequin sizes.
- SigLIP 2 (L-sized) additionally outperforms the AIMv2 mannequin.
Dense prediction duties
The frozen SigLIP 2 illustration is probed with a linear layer or a DPT decoder on six benchmarks for semantic segmentation, monocular depth estimation, and floor regular estimation. The output embedding of the MAP head is concatenated to every patch function vector.
- SigLIP 2 outperforms a number of earlier open, CLIP-style imaginative and prescient encoders, together with SigLIP, usually by a major margin.
Open-Vocabulary Segmentation
Cat-Seg framework is used to coach on COCO-Stuff-164k with 172 lessons. Examined on ADE20k (847 or 150 lessons), Pascal Context (459 or 59 lessons), and Pascal VOC (20 or 21 lessons).
- SigLIP 2 at L/16 improves upon SigLIP and surpasses the bigger OpenCLIP G/14 mannequin.
Localization duties
A 6-layer transformer decoder is connected to the frozen imaginative and prescient encoder of SigLIP 2 and educated on a mixture of RefCOCO variants.
Referring Expression Comprehension
- SigLIP 2 considerably outperforms SigLIP, CLIP, and a captioning-based pretraining method (Cap) throughout resolutions and mannequin sizes.
- This enchancment is attributed to the decoder-based pretraining.
- SigLIP 2 is barely outperformed by LocCa, probably resulting from SigLIP 2’s multilingual pretraining knowledge in comparison with LocCa’s English-only coaching knowledge.
Open-vocabulary Detection
- SigLIP 2 achieves higher efficiency than SigLIP on COCO and LVIS benchmarks.
- The development is especially noticeable for LVIS uncommon classes.
- SigLIP 2 additionally outperforms the leads to OWL-ViT, seemingly as a result of OWL-ViT used CLIP as a substitute of SigLIP.
SigLIP 2: Multilingual Imaginative and prescient-Language Encoders with Improved Semantic Understanding, Localization, and Dense Options 2502.14786