By Bert De Brabandere on June 29th, 2023
This blog post is an adaption of Bert’s talk at the ICRA workshop on Scalable Autonomous Driving.
Segments.ai is a data labeling platform focused specifically on data labeling for robotics and autonomous driving. In this post we peek behind the curtain on how we’re labeling multi-sensor data, and the steps we take to automate it as much as possible. We’ll deep-dive into some of the technology we have today and zoom in on some exciting things we’re currently working on.
Let’s first sketch some context. This is the ML data engine everyone’s familiar with:
You collect and curate data, label it, train a model and deploy it. And you do multiple iterations of this loop to improve your model.
At Segments.ai, we focus on the data labeling part of this loop. More specifically, our expertise is in multi-sensor data labeling for robotics and autonomous vehicles.
What does that mean?
In the video on the left, you see what a typical mobile robotics customer uploads to our platform: raw, unlabeled, multi-sensor data, typically consisting of a lidar point cloud and multiple camera images from sensors mounted on a robot or autonomous vehicle.
And on the right is what they get back: perfectly annotated ground truth data - bounding boxes and segmentation data labels in 3D and 2D - with consistent object ids across time and sensors.
For us as a data labeling platform, the question is: how can we support labeling such multi-sensor data as accurately and efficiently as possible? How to minimize the amount of human work needed? A big part of the answer is that we ourselves leverage AI and machine learning for that.
Many of our customers have experience developing machine learning models that must run on a vehicle or a robot. But there are a few key differences when developing machine learning models for data labeling specifically.
The first is that machine learning models for data labeling typically have humans in the loop: they’re designed to work in tandem with a human who can verify and correct their predictions. That means that we need to design interfaces where humans and machines can interact with each other. It also means that we prefer models to make mistakes that are easy to correct. For example, in object detection problems, we prefer higher recall over higher precision, as it’s much quicker to delete a false positive detection than manually adding a missing detection.
Another big difference is that machine learning models for data labeling don’t need to be real-time. They don’t need to run at 10fps on a robot, so that we can go all the way for accuracy in the speed-vs-accuracy trade-off. It also means that we can use “slow” models like diffusion models and NeRFs that are not yet suitable for real-time inference.
Finally, machine learning models for data labeling are allowed to cheat. For example, when making predictions on sequence data, they can “look ahead” and consider future frames when making predictions for the current frame. All of this makes ML for data labeling a unique and fun challenge.
Let’s now take you through our process for labeling multi-sensor data. High-level, it’s a process that consists of three steps: auto-labeling, manual verification and correction, and cross-sensor projection.
The first step is auto-labeling and is pretty straightforward: we run an object detection model on the 3D point cloud data to make automatic predictions. We use standard object detection models for this, which we can also finetune on the customers’ data once we have labeled some initial data. We also tweak these models to have high recall (at the cost of lower precision) to ensure we have few missing detections. Better a detection too much than a missing one. Sometimes, our customers also bring their own models for this initial step.
The next step is manual verification and correction of the 3D predictions. This step involves humans, who must be equipped with the right tools to efficiently interact with the data and machine-generated predictions. This is really a UX problem, and it’s our bread and butter. Here we quickly highlight a few features that make the annotator’s life much easier when labeling point cloud data.
The first one is the synced camera feature: When labeling a point cloud, it’s often hard to know what you’re looking at. Especially if the point cloud is not very dense.
That’s why we’ve developed a synced camera view which always shows the camera image corresponding to where your mouse pointer is in the 3D space. You can also open a specific camera image and see the overlaid point cloud and annotations. This makes it much easier to localize objects of interest and see if a blob of points is, for example, a car or a van.
Another very useful feature is the batch mode. Here, you can zoom in on a specific object track and quickly adjust its cuboid throughout the sequence.
When the auto-labeling step doesn’t detect an object, and the object needs to be labeled from scratch, we have an ML-assisted cuboid propagation mode here: you label the cuboid in the first frame, and the cuboid will automatically propagate to the following frames. This batch mode feature is obviously useful when labeling dynamic/moving objects.
We have another feature called merged point cloud mode for labeling static objects that are not moving throughout the sequence. If you enable it, all point clouds are aggregated across the sequence in a single view.
The merged point cloud mode gives you a much higher-resolution view of static objects, so you can better see what they are and label them more precisely. This feature is possible because in the backend we split up the point cloud into tiles and stream them dynamically as you zoom in and move the camera. Just like how Google Maps works.
These are just a few of the tools we have to make labeling more effortless and we’re working on many more, like the ability to highlight potential labeling mistakes automatically.
Most of our customers not only want their point clouds labeled but also their images. And they want the object ids of the annotations to be consistent across these different sensors. We do this by projecting the 3D annotations onto the 2D images, optionally followed by another manual verification step.
For bounding box annotations, this is pretty straightforward. If we have accurate lidar-camera calibration, we project the vector annotations onto the 2D images. You can do this projection with the click of a button on our platform.
On Segments.ai, you can also do point cloud segmentation. But how can we transfer those 3D segmentation labels to the images, turning them into segmentation masks? This is not at all obvious, and we’re currently exploring some interesting solutions here.
What happens if you project the segmented point cloud points onto the camera images? You get something like the image below. You get a very sparse segmentation of your image, only at the pixels aligned with a lidar beam. How can we get a full segmentation mask from this? It looks like an inpainting problem: we know the segmentation labels at certain pixels, but we need to impute the missing labels at the other pixels, conditioned on the image. That sounds like a job for a generative model, like a diffusion model.
This is something we’re currently exploring. Diffusion models are a class of generative models that iteratively turn noise into a data sample. They’re most popular for generating images conditioned on text. For example, the Stable Diffusion models are well-known for this.
But in our case, we instead want to generate segmentation labels, and we want to do that conditioned on RGB images. Diffusion models are a perfect fit for our inpainting problem: we can keep the “known” pixels clamped and iteratively apply the diffusion process to the other pixels.
We finetuned an adapted Stable Diffusion model for this task, and we’re already getting some preliminary but promising results. One of the main challenges we face is making this work on higher resolutions. On the right, you see some first progress in that direction. This type of diffusion model for image segmentation is extremely versatile, and we have a lot of other uses in mind for them.
Why do we label in 3D first and then project to 2D?
You could also imagine a different workflow where you for example label the images first and then leverage those 2D annotations to speed up the 3D labeling. The truth is that labeling in 3D space is often much more efficient, even if you’re only interested in 2D labels. A simple example makes this clear.
Let’s imagine your car is driving past a static object, for example, a traffic sign. Your vehicle has multiple cameras, and as you drive past the traffic sign, it is visible in 3 of them for, say, 100 frames. This means that you’ll have to annotate 300 2D bounding boxes.
In the 3D space, though, labeling this static, non-moving traffic sign merely takes one cuboid annotation. Labeling a 3D cuboid takes 3x as long to annotate as a 2D bounding box, but it’s still 100x more efficient than labeling in the images.
That is why at Segments.ai, we take the approach to always start from the 3D labeling and only then project to 2D.
One last topic to explore is that of voxel grid labels. A voxel grid (often called an occupancy grid) is a 3D volume of voxels, indicating both the occupancy and semantics of each voxel. We see more and more customers experimenting with voxel grid representations and asking if we can provide ground truth labels for them.
Why would you need this type of label? More and more robotics and AV companies are moving from a late fusion approach to an early fusion approach for their ML models. What does this mean?
In a late fusion approach, you run separate ML models for each of your sensors, and you then somehow fuse their outputs to obtain a consistent 3D scene.
In an early fusion approach, you instead throw all your sensor data into a single ML model and try to make predictions directly in 3D space. It’s a more modern approach and something that for example Tesla is doing.
Voxel grids are an ideal scene representation for early fusion because they’re so regular. It’s just a tensor, and you can try to predict it end-to-end.
How can we convert 3D segmentation labels into voxel grid labels? That’s relatively straightforward: you convert the point cloud points into voxels. But to get excellent voxel grids, we need dense point clouds. We are exploring a few strategies to make sparse point clouds denser.
The first one is straightforward: you can aggregate (accumulate) point cloud points across multiple frames in a sequence. Of course, this requires accurate ego-pose parameters and only works for static objects that do not move.
For dynamic, moving objects, we can do something different and leverage the camera images to obtain denser depth. This problem is known as depth completion or sparse-to-dense depth estimation. It’s again an inpainting problem: you know the depth at specific pixels corresponding to the lidar measurements, and you need to paint the missing depth values at the other pixel locations.
So here, too, we are experimenting with the same kind of diffusion models I discussed before. This time not to denoise segmentation masks but to predict dense depth maps.
Going further, some of our robotics customers don’t have a lidar sensor at all, only cameras. We’re looking at 3D reconstruction using neural radiance field (NeRFs) for this use case. We’re in the very early stages here, and there are lots of challenges to make this work well for large scenes, but we’re pretty excited about NeRFs for data labeling, and we’re keeping a close eye on the latest research.
As you see, there are a lot of exciting challenges in multi-sensor data labeling. Data labeling is often seen as a boring part of the ML pipeline, but we’re pretty excited about it. If you would like to help us work on these problems: we’re hiring, so definitely reach out to us!