Edit model card

Model card for ViT-B-16-SigLIP-i18n-256

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.

This model has been converted from Open-CLIP : timm/ViT-B-16-SigLIP-i18n-256 to huggingface CLIPVisionModel

from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)

vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
outputs = vision_tower(**inputs)

logits_per_image = outputs.pooler_output  # this is the image-text similarity score

There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.

Downloads last month
16
Inference Examples
Inference API (serverless) is not available, repository is disabled.