Lidar sensors use laser beams to capture the world in 3D. The sensors output 3D point clouds, which are simply collections of points in 3D. Machine learning models can be used to detect and track objects in these point clouds, or even to classify every single point (segmentation). This enables autonomous vehicles to understand their surroundings, and can also be used to make cities smarter, to create AR/VR applications, and for indoor design/real estate applications.
In this article, we give an overview of 10 public labeled lidar datasets that you can use in your autonomous driving projects. The mentioned datasets contain either 3D bounding box (cuboid) labels or segmentation labels. We’ll also show how you can create your own 3D point cloud dataset, in case the open datasets do not fit your use case or if their licenses are too restrictive (only 2 datasets can be used commercially).
3D point cloud driving datasets
KITTI is a dataset of lidar sequences of street scenes in Karlsruhe, Germany. The dataset was launched in 2012 and different labels have been added over the years.
11. Bonus: JackRabbot Dataset and Benchmark (JRDB)
JRDB is a dataset collected by a social robot called JackRabbot. It features sequences from different indoor and outdoor locations on the Stanford University campus. Since the robot’s size is comparable to a human, the data has a different perspective than the other car-based datasets.
Every machine learning system needs the right data to perform well. Public datasets can help you experiment quickly, but they often are not suited for training your final models. The ML models might perform worse when you deploy them, because the data in public datasets might be recorded in a different environment (country, weather condition), or because your sensor set-up is different. To avoid this performance loss, we’ll show you how you can create your own 3D point cloud dataset.
Creating a dataset requires three steps:
1. Data collection
Data collection involves acquiring the right tool to capture new data, e.g. a vehicle with a lidar sensor, and going out and capturing the data. You should capture data in an environment that matches the real production environment as closely as possible.
2. Data selection/curation
Next, you often have to select which captured data you want to include in the dataset, as it can be infeasible and inefficient to use all captured data. Here, it is important to choose diverse data that covers all the different scenarios you captured. Discarding boring data can speed up labeling and improve model performance.
3. Data labeling
Finally, you have to label your selected data. For object detection/tracking, this means drawing 3D bounding boxes (cuboids) around the objects you want to detect. For segmentation, you have to annotate the individual points in your point clouds. This can be a tedious and time-consuming process, but with the right tools you can speed up your labeling significantly.
Segments.ai has dedicated labeling interfaces for 3D point cloud data. If you work with sequential data, you can use our interpolation feature to label faster. To speed up your labeling even further, Segments.ai also allows you to set up model-assisted workflows, where you train an initial model on a small set of labeled data, and then use the model to help label the complete dataset. Finally, you can choose whether to label the data in-house or work with an external workforce.
Next, we’ll convert the W&B artifact to a dataset on Segments.ai, our labeling platform. This is easy to do programmatically using the simple Segments.ai Python SDK.
Autonomous vehicles use lidar sensors to see the world around them in 3D. To detect objects and understand the scene, we need 3D point cloud datasets. In this article, we highlighted 10 lidar datasets for autonomous driving. The datasets can be used for tasks such as 3D object detection, 3D MOTS, and 3D point cloud segmentation.
If you want to create a machine learning model for a different application, if you want to use different categories, or if your data in production differs from the data in the public datasets, you’ll have to create your own dataset. For this, you need to collect, curate, and label data. For lidar data, Segments.ai is the best tool for labeling your data and managing the labeling workforce.
Hope this was useful! If you have any questions or suggestions, feel free to send us an email at email@example.com