Best practices for ML teams: working with annotation providers and platforms

7 min read -
Avatar photo
- May 6th, 2024 -

Computer vision technology has revolutionized multiple industries, from autonomous vehicles to advanced robotics. The foundation of these innovations lies in high-quality data annotation to build ground truth data, a critical yet complex process.

This article, for computer vision engineers or data scientists, discusses some best practices for setting up and managing the annotation process.

This article is based on the in-depth webinar between and Humans in the Loop. You can listen to the entire webinar here.

The path to automation in data annotation

With generative AI peaking in all disciplines, such as Microsoft Copilot for coding, the expectations of what automation can do are growing. It’s becoming a prerequisite to using such tools to your benefit.

This also applies to data annotation. At ICRA, CVPR, or IROS exhibitions, many papers focus on beating the labeling benchmarks with new AI tools on the most popular open datasets. Big companies like Meta are releasing open-source software such as SAM to speed up image data segmentation.

But today, no off-the-shelf annotation setup exists that can perfectly automate any annotation task. It’s similar to the chicken and the egg story: models can’t generate accurate pre-labels without having been trained on similarly annotated data. This is especially true for more advanced tasks such as point cloud segmentation.

These are the steps that need to be taken to speed up labeling.

Bar chart showing the evolution of machine learning in labeling tools. The chart starts with ML-powered labeling tools, followed by zero-shot model-assisted labeling, domain-specific model assisted labeling, and peaks at customer-specific model assisted labeling.

Step 1: Manual annotations with ML-powered tooling

The first step is manual annotations, supported by ML-powered annotation tooling. What does that mean?

If you’re an annotator, you use specific tooling to speed up the annotations and increase their accuracy. For example, instead of manually annotating the borders, you get additional aids to easily, quickly, and accurately annotate the borders of objects automatically.

What distinguishes it from model-assisted labeling approaches, as discussed next, is that a cue of a labeler is still needed somewhere on the screen, either a hover or one or more clicks.

Some SAM implementations, such as our hover-and-click implementation, fit in this bucket. Other examples are superpixel or autosegment features.

Similar to what is described in more detail in the next part, ML-powered labeling tools can also be either zero-shot (the most common), domain-specific (such as’s proprietary automotive-specific superpixels), or customer-specific.

Step 2: Model-assisted labeling: zero-shot, domain-specific or customer-specific

The next step is what we call model-assisted labeling. This means you use a model that predicts what you try to annotate – without the cue of a labeler. The taskers then do not annotate from scratch but solely focus on correcting predictions.

Model-assisted labeling is also often called the humans-in-the-loop workflow, where you have models providing pre-labels. You can leverage the models you’re building for deployment purposes or train models dedicated to pre-labeling efforts. While models for deployment require lower latency, a model fit for pre-labeling can be larger with many more parameters for annotation purposes.


There are three types of model-assisted labeling setups:

  • Zero-shot model-assisted labeling

    In this approach, you leverage pre-existing models that are not trained on similar tasks but instead are generalized and trained on large-scale corpuses of data. An example is a zero-shot implementation of SAM. Although zero-shot models can seemingly provide decent results, they generally lack the qualitative output required to make zero-shot model-assisted labeling workflows successful, as taskers need more time correcting the predictions than labeling from scratch.

  • The domain-specific model-assisted labeling setup

    Instead of leveraging general models, the models infer knowledge from the specific domain or use case. For example, you could leverage off-the-shelf automotive object detection models to pre-label your data, even though your data is taken from different hardware or you have slightly different labeling guidelines.

  • Customer-specific model-assisted labeling setup

    Here, you leverage your own models in the loop trained on your own data set, with your own ontology and labeling guidelines. You run the models on new data and instruct the annotation workforce to correct the pre-labels. The ROI of this approach is the highest as the humans-in-the-loop efforts should be most efficient, although this is typically only obtainable after a broad set of data has been labeled and a well-performing model has been trained.

Choose the right tools and teams

For relatively straightforward tasks, such as classifying cats versus dogs, one could consider a one-stop shop: a single vendor to whom customers can easily send their data and instructions and get it back annotated after a while. There is rarely a discussion of what a cat is and what a dog is.

When working with multi-sensor use data, the requirements become complex, and the number of edge cases quickly grows exponentially. Additionally, the required tooling is much more advanced.

In the past, computer vision engineering teams only talked with the annotation workforces. Typically, the annotation workforce used open-source tooling for easy use cases or referred to an annotation tooling provider when needed for more advanced use cases. As the engineering team only had one point of contact (and one contract) with the workforce team, one could still consider this a one-stop shop.

Simple flowchart illustrating the workflow between three teams: Computer Vision Team, Workforce Team, and Tooling Team, with arrows indicating interaction and collaboration among them.

Today, as complexity and requirements grow, tooling providers such as are required to be in touch with the computer vision engineering teams directly. Onboarding the data, setting up the instructions and interfaces, or adjusting any specific QA pipelines requires resources and close collaboration. Although there could still be a single contract with a single main contracting party, there are two points of contact: one with the tooling vendor and one with the workforce vendor.

Simple flowchart illustrating the workflow between three teams: Computer Vision Team, Workforce Team, and Tooling Team, with arrows indicating interaction and collaboration among them.

Staffing a continuous workforce team versus having bursty workloads paid per annotation

Avoiding “bursty” workloads where a team needs to spin up for only short periods is often recommended. The annotation provider incurs significant overhead, such as having a large group of people on the bench at all times as a buffer. Simultaneously, it’s not ideal for the customer either:

  • Because of the overhead for the workforce, a higher price for non-continuous workloads is incurred.
  • A different group of annotators is assigned for each burst, who need to be trained in the annotation specs and ramped up. This leads to lower annotation speed and quality.
  • The top annotators naturally move and stick to projects where they are staffed continuously. Our customers who staff workforce teams continuously keep the excellent performers and ask that the bad performers (in terms of speed/quality) be replaced. Those bad performers then go “on the bench” and in the pool reserved for bursty workloads.

This is why almost all of our mature customers are staffing a workforce team on a continuous basis (from 2 to over 100 FTEs, paid monthly for 160 hours of work per FTE). Some of them also have bursty high-priority workloads, but they also have a backlog of lower-priority work to ensure the team never runs out of work.

We highly recommend this setup if you want to scale up your annotation efforts:

  • After a first model is trained, customers typically want to leverage their predictions in the loop. This is called model-assisted labeling. A dedicated team allows benchmarking between different workflows without additional negotiations and up costs.
  • Customers rarely stick with a single labeling task and pursue multiple different goals. For example, self-driving vehicle companies often require both cuboid labeling for perception tasks and 3D polygon labeling for mapping tasks. Hourly models accommodate this setup well.

How to pick the right annotation provider

  • Experience. Pick a workforce with expertise in your domain. That goes for both the workforce and the tooling.
  • Establishing clear communication and transparency when working together, including discussing pricing structures instead of relying on an average calculation, is essential.
  • Define what is important to you as a company: ethical AI, social profit, or communication skills.

Improving quality by optimizing communication flows

Annotation guidance

Annotation guidance is needed to kick off the labeling. It should be exhaustive – the more thorough, the better – because it will give a better understanding of the requirements.

This document needs to be provided and discussed upfront. It’s a living document that you refine together repeatedly.

Download your free labeling specs guide with pre-filled defaults.

Balancing speed and efficiency versus quality and accuracy

1. Start with quality

It’s important to define “quality” clearly. Once you’ve set the quality standard, try to achieve that quality level and evaluate your throughput.

Including specific metrics is helpful, but it’s often suggested to avoid relying solely on an overall accuracy percentage. In the context of street scenery, accuracy cannot be represented by a single percentage in a straightforward manner. Accuracy includes annotation tightness, categorization accuracy, etc.

Starting with quality sounds like a no-brainer. However, many companies prioritize price and the number of annotations or frames, placing throughput above quality. Requests for Proposals (RFPs) and Requests for Quotations (RFQs) frequently begin with questions about labeling turnaround time for specific types of data, the cost of annotating a thousand frames, and the cost comparison between annotating a small number of frames versus a much larger quantity – all before even having discussed what needs to be labeled.

2. Improve throughput

To improve your throughput while maintaining the quality, focus on the following:

  • The learning curve: annotators doing the same tasks will become more efficient over time. This only applies when you have a continuous workforce, not when dealing with a constantly changing team paid per annotation.
  • Use the right tools for your specific type of annotation. Utilizing features suited to your data can significantly enhance throughput while maintaining quality.
  • Start with a pilot

A pilot is recommended before moving to production. A pilot allows one to focus on the exact data at hand and discuss the actual requirements.

Within the pilot, an analysis is made to identify which tools, features, and workflows are needed to meet the quality. Second, a timing baseline is established.

Pilots can either be free of charge or incur a cost depending on the scope and level of sophistication of the labeling requirements.

Check-in meetings and continuous feedback loops

Having recurring check-ins with all parties can make all the difference. They don’t have to be weekly meetings; they can be monthly or optional. You want to discuss the throughput and priorities, as well as the edge cases that the annotators noticed.

Let’s take a recent example of autonomous trucks at a fulfillment center. Trucks need to reposition continuously to optimize fulfillment flow. The customer’s original specs document included the requirement that all cones ought to be annotated. The frequent check-ins made the need to classify these cones into subcategories apparent, and the labeling specification guide was updated. In the case of a one-stop shop, a restart might have been required when only verifying the quality after the first delivery.

Often overlooked is the usage of features. is continuously building new features. Customer requests drive our roadmap. Those features are then released to the users, but not all annotators on your team might instantly see how this would speed up the labeling or increase the accuracy of the annotations. By discussing this with the team, the overall quality and throughput of the annotations go up.

Ensure the privacy and security of data

Of course, a best practice is to anonymize any personally identifiable and sensitive data. The perception team most often does this, but we can also help with that when needed.

Data storage is best kept on your servers or any other place within your control. The natural reflex is to go with on-premise solutions, often with strict limitations. A cloud solution can be equally secure and offer the advantages of flexibility and scalability.

Flowchart depicting the interaction between different teams and systems in a computer vision project. The diagram shows a 'Computer vision team' linked to a 'customer bucket', which connects to both 'Workforce team' with 'annotator front-end' and 'Tooling team' with ' back-end', all coordinated through a REST API.

Keep your data on customer’s buckets.

When labeling data on, there are two options: uploading data to servers or keeping data on private on-premise, personal cloud, or serviced cloud buckets such as AWS S3, GCP, or Azure.

Mature customers typically follow the latter approach, in which case our platform only stores pointers to the data on the customer’s server. When an annotator launches the platform, the URL (stored in our database) is triggered, and the data is loaded directly from the customer’s server to the annotator’s device.

This setup has a couple of different advantages:

  • Data is only shared outside of the servers when it needs to be annotated. This limits security risks and saves on cloud bandwidth costs.
  • The customer stays in control of who can access the URLs. These can be secured in various ways and made invalid at any time.
  • The engineering teams have direct control over what needs to be labeled and do not need to wait for the data to be uploaded to the tooling servers.