Lidar sensors use laser beams to capture the world in 3D. The sensors output 3D point clouds, which are simply collections of points in 3D. Machine learning models can be used to detect and track objects in these point clouds, or even to classify every single point (segmentation). This enables autonomous vehicles to understand their surroundings, and can also be used to make cities smarter, to create AR/VR applications, and for indoor design/real estate applications.
In this article, we give an overview of 10 public labeled lidar datasets that you can use in your autonomous driving projects. The mentioned datasets contain either 3D bounding box (cuboid) labels or segmentation labels. We’ll also show how you can create your own 3D point cloud dataset, in case the open datasets do not fit your use case or if their licenses are too restrictive (only 2 datasets can be used commercially).
KITTI is a dataset of lidar sequences of street scenes in Karlsruhe, Germany. The dataset was launched in 2012 and different labels have been added over the years.
License: Creative Commons Attribution-NonCommercial-ShareAlike 3.0
Lidar sensor: Velodyne HDL-64E
nuScenes is a large-scale autonomous driving dataset consisting of urban street scenes captured in Singapore and Boston, U.S.
Download (account required)
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International, or acquire a commercial license
Waymo Open is a diverse autonomous driving dataset. It includes scenes captured in 6 U.S. areas in a wide variety of environments and weather conditions.
1200 labeled sequences, 4 categories
1150 labeled sequences, 23 categories
A2D2 stands for Audi Autonomous Driving Dataset (A2D2). The data was captured in 3 German cities.
License: Creative Commons Attribution-NoDerivatives 4.0 International
Lidar sensors: 5x sensor
12,499 labeled frames, 14 categories
41,280 labeled frames, 38 categories
Labels obtained from 2D semantic segmentation on camera images
Argoverse 2 is a collection of open-source autonomous driving data from six U.S. cities.
License: Attribution-NonCommercial-ShareAlike 4.0 International
Lidar sensors: 2x Velodyne VLP-32C
ApolloScape is an autonomous driving dataset created by Baidu research. The dataset was collected under various lighting conditions and traffic densities in Beijing, China.
License: academic use only
Lidar sensors: 2x Riegl VMX-1HA
PandaSet is a high-quality dataset for autonomous driving created by lidar producer Hesai. Its 100+ scenes are selected from two routes in Silicon Valley.
WADS is a dataset of 20 scenes for autonomous driving collected in severe winter weather in Michigan, U.S.
Seeing Through Fog is a driving dataset part of the DENSE project. The data includes different weather conditions like fog, snow, and rain and was captured in northern Europe.
12000 labeled frames, 28 categories
Toronto-3D is a detailed dataset of 1km of road in Toronto, Canada.
JRDB is a dataset collected by a social robot called JackRabbot. It features sequences from different indoor and outdoor locations on the Stanford University campus. Since the robot’s size is comparable to a human, the data has a different perspective than the other car-based datasets.
57600 labeled frames
Every machine learning system needs the right data to perform well. Public datasets can help you experiment quickly, but they often are not suited for training your final models. The ML models might perform worse when you deploy them, because the data in public datasets might be recorded in a different environment (country, weather condition), or because your sensor set-up is different. To avoid this performance loss, we’ll show you how you can create your own 3D point cloud dataset.
Creating a dataset requires three steps:
Data collection involves acquiring the right tool to capture new data, e.g. a vehicle with a lidar sensor, and going out and capturing the data. You should capture data in an environment that matches the real production environment as closely as possible.
Next, you often have to select which captured data you want to include in the dataset, as it can be infeasible and inefficient to use all captured data. Here, it is important to choose diverse data that covers all the different scenarios you captured. Discarding boring data can speed up labeling and improve model performance.
Finally, you have to label your selected data. For object detection/tracking, this means drawing 3D bounding boxes (cuboids) around the objects you want to detect. For segmentation, you have to annotate the individual points in your point clouds. This can be a tedious and time-consuming process, but with the right tools you can speed up your labeling significantly.
Segments.ai has dedicated labeling interfaces for 3D point cloud data. If you work with sequential data, you can use our interpolation feature to label faster. To speed up your labeling even further, Segments.ai also allows you to set up model-assisted workflows, where you train an initial model on a small set of labeled data, and then use the model to help label the complete dataset. Finally, you can choose whether to label the data in-house or work with an external workforce.
Check out SemanticKITTI on Segments.ai. You can also try the platform for free for 14 days, or book a demo. We’re always happy to see if Segments.ai is the right fit for your use case, so do not hesitate to get in touch.
A showcase of Segment.ai’s lidar interface
Autonomous vehicles use lidar sensors to see the world around them in 3D. To detect objects and understand the scene, we need 3D point cloud datasets. In this article, we highlighted 10 lidar datasets for autonomous driving. The datasets can be used for tasks such as 3D object detection, 3D MOTS, and 3D point cloud segmentation.
If you want to create a machine learning model for a different application, if you want to use different categories, or if your data in production differs from the data in the public datasets, you’ll have to create your own dataset. For this, you need to collect, curate, and label data. For lidar data, Segments.ai is the best tool for labeling your data and managing the labeling workforce.
Hope this was useful! If you have any questions or suggestions, feel free to send us an email at email@example.com