Back to the articles

Zero-shot object detection with OWL-ViT

September 22nd, 2022 - 2 min -
Avatar photo

What if you could detect objects of any type in an image, without having to train a custom ML model? That’s the promise of zero-shot object detection, a computer vision technique that’s gaining ground quickly.

At we try to stay on top of the latest and greatest in computer vision, so we went ahead and built a zero-shot object detection demo based on Google AI’s OWL-ViT paper. You can use this tool to interactively find text queries and thresholds that work well on your images.

You can also leverage zero-shot detection to prelabel images on This feature is still in beta, please reach out if you want to try it!

What is zero-shot object detection?

A regular object detection model is trained on a fixed set of categories, for example cats, dogs and birds. If you want to detect a new type of object, like a horse, you have to collect and label lots of images with horses and retrain your model.

A zero-shot object detection model on the other hand, is a so-called open-vocabulary model: it can detect a huge number of object categories without needing to retrain it. These categories are not predefined: you can provide any free-form text query like “yellow boat” and the model will attempt to detect objects that match that description.

How does it work?

The secret behind zero-shot object detection models like OWL-ViT is that they are trained on massive datasets of image-text pairs, often scraped from the internet. The heavy lifting is done by a CLIP-based image classification network trained on 400 million image-text pairs, and adapted to work as an object detector. The largest model took 18 days to train on 592 V100 GPUs.

Clearly, training such large models is not within everyone’s reach. Luckily that’s also not necessary: the code and model weights of OWL-ViT are made open source, and the entire point of the model is that it works out-of-the-box on a huge number of object categories. No need to retrain or fine-tune it on a custom dataset.

For our demo, we used the Hugging Face implementation of the OWL-ViT model and deployed it to a cloud GPU. Inference takes about 300ms, making interactive exploration possible. Users can thus tweak the text queries and thresholds in an interactive web application to find the ones that work best for their images.

In June 2023, an updated version of the OWL-ViT model, called OWLv2, was released by researchers at Google DeepMind. It has since also been included in the Hugging Face transformers library, and the weights are also available on Hugging Face.

Downsides of zero-shot object detection models

The accuracy of a zero-shot object detection models does not yet match that of object detection models trained on a fixed set of categories. Additionally, when you fine-tune a zero-shot performance model on a specific set of images, the model no longer generalizes as well.

Furthermore, regular object detection models can also be smaller and thus faster and cheaper to run than zero-shot models. This can be especially important for low-latency applications. Some regular object detection models can even perform object detection in real time on a smartphone. In contrast, OWL-ViT and OWLv2 require a powerful GPU for fast inference.

However, even if you want to train a regular object detection model, you can still use a zero-shot model to prelabel your dataset: instead of labeling images from scratch, you simply verify and correct the predictions of the zero-shot model. Then, you use these labels to train a regular object detection model of your choice. makes it very easy to set up such model-assisted labeling workflows, both with zero-shot object detection or with your own models. Sign up for a free trial or contact us if you want to try this yourself.

Share this article