Zero-shot object detection with OWL-ViT

By Bert De Brabandere on September 22nd, 2022

What if you could detect objects of any type in an image, without having to train a custom ML model? That’s the promise of zero-shot object detection, a computer vision technique that’s gaining ground quickly.

At we try to stay on top of the latest and greatest in computer vision, so we went ahead and built a zero-shot object detection demo based on Google AI’s OWL-ViT paper. You can use this tool to interactively find text queries and thresholds that work well on your images.

You can also leverage zero-shot detection to prelabel images on This feature is still in beta, please reach out if you want to try it!

Zero-shot object detection demo

What is zero-shot object detection?

A regular object detection model is trained on a fixed set of categories, for example cats, dogs and birds. If you want to detect a new type of object, like a horse, you have to collect and label lots of images with horses and retrain your model.

A zero-shot object detection model on the other hand, is a so-called open-vocabulary model: it can detect a huge number of object categories without needing to retrain it. These categories are not predefined: you can provide any free-form text query like “yellow boat” and the model will attempt to detect objects that match that description.

How does it work?

The secret behind zero-shot object detection models like OWL-ViT is that they are trained on massive datasets of image-text pairs, often scraped from the internet. The heavy lifting is done by a CLIP-based image classification network trained on 400 million image-text pairs, and adapted to work as an object detector. The largest model took 18 days to train on 592 V100 GPUs.

Clearly, training such large models is not within everyone’s reach. Luckily that’s also not necessary: the code and model weights of OWL-ViT are made open source, and the entire point of the model is that it works out-of-the-box on a huge number of object categories. No need to retrain or finetune it on a custom dataset.

How can I use it?

Check out our zero-shot detection demo to try it on some example images or on your own images.

We used the Hugging Face implementation of the OWL-ViT model and deployed it to a cloud GPU. Inference takes about 300ms, making interactive exploration possible: make sure to tweak the text queries and thresholds to find the ones that work best for your images!

You may notice that the accuracy of a zero-shot object detection model does not yet match that of regular object detection models trained on a fixed set of categories. However, a perfect use case for zero-shot object detection is to use it as a prelabeling technique when labeling your own dataset: instead of labeling images from scratch, you simply verify and correct the predictions of the zero-shot model. makes it very easy to set up such model-assisted labeling workflows, both with zero-shot object detection or with your own models. Sign up for a free trial or contact us if you want to try this yourself.

Bert De Brabandere