Back to the articles

7 State-Of-The-Art Point Cloud Models for Autonomous Driving

December 13th, 2023 - 6 min -
Avatar photo
-

Over the past couple of years most state-of-the-art computer vision (deep learning) models have converged to use the transformer architecture. This trend has also emerged in deep learning models that work on point clouds (or both point clouds and images). However, these models are still harder to generalize to other types of sensors, for example, because the density of points can be very different between sensors.

In this blog post, we’ll highlight 7 deep learning models (i.e., BEVFusion, GeomGCNN, EA-LSS, FocalFormer 3D, GLENet, PointMLP, and GDANet) that are state-of-the-art on benchmarks such as NuScenes, KITTI, and PointCloud-C, and that can serve as a great starting point to train your custom 3D model!

Let’s dive in.

Do you just want to pick a good model and start using it for your use case?
Have a look at the overview table

BEVFusion

Bird’s-eye view fusion or BEVFusion is a popular model developed in 2023 by researchers from MIT. It performs very well on 3D object detection (i.e., it ranks #3 on the NuScenes dataset measured by mean average precision). It’s licensed with the Apache 2.0 license, so you can use it for commercial purposes.

Novelty

The novelty of the authors’ approach is twofold.

  • It unifies multi-modal features in the shared bird’s-eye view (BEV) representation space, which preserves both geometric and semantic information
  • The authors optimize the BEV pooling step which they claim reduces latency in the view transformation step by 40x

Image taken from BEVFusion paper.

Summary

  • Task. Point cloud object detection
  • Modality. Multi-sensor (point cloud and images)
  • License. Apache 2.0 (you can use it commercially)
  • Paper
  • Code
  • Demo page

GeomGCNN

Geometric graph convolutional neural network or GeomGCNN is a model from 2021 developed by research from, among others, IIT Kanpur and TensorTour. It performs well on point cloud classification and segmentation (ranks #1 on the ModelNet 40 dataset for classification and #1 on the ShapeNet-Part dataset for segmentation).

Novelty

The novelty of the method can be summarized as follows.

  • The vertex representations are augmented with local geometric information, followed by a non-linear projection (using a multilayer perceptron)
  • Existing methods based on k-nn (k nearest neighbors) do not take the geometry of the points into account. The authors sample points in a different way that helps capture a larger part of the point cloud for locally dense point clouds, improving performance

Summary

  • Task. Point cloud segmentation and classification
  • Modality. Point cloud
  • License. Not open source
  • Paper
  • Demo page

EA-LSS

Edge-aware lift-splat-shot model or EA-LSS is a model from August 2023 by researchers from, among others, Zhejiang Leapmotor Technology and Oppo Research Institute. At the time of writing, it achieves state-of-the-art results on NuScenes 3D object detection (it ranks #1 measured by normalized detection score and mean average precision).

Lift-splat-shoot is a method from NVIDIA where the depth of 2D (image) pixels are predicted (i.e., a probability distribution is predicted). When you combine the camera extrinsic with the depth estimations, you get a 3D probability distribution of points that, for example, you can use to navigate a car.

Novelty

The authors of EA-LSS note that the ground-truth depth information of point clouds can be better used to train the depth prediction model. They propose two methods that can boost the depth model’s performance during training.

  • An edge-aware depth fusion module is introduced that helps to learn better predictions for depth in regions close to the edges (other methods struggle with this region because the depth in this region varies a lot)
  • They use a fine-grained depth module to upsample the point cloud. This helps because it creates a finer depth ground truth that we can use during training of the image depth network

Visualization of the depth jump problem (the depth around the edges of point cloud objects is not accurate). Image taken from EA-LSS paper.

Summary

  • Task. Object detection
  • Modality. Multi-sensor (point cloud and images)
  • License. Apache 2.0 (you can use this model commercially)
  • Paper
  • Code

FocalFormer 3D

FocalFormer 3D is a model from August 2023. It’s from researchers from NVIDIA, Caltech, and The Chinese University of Hong Kong. Their largest model (FocalFormer 3D-F) performs well on object detection on the NuScenes dataset (#8 measured by normalized detection score).

Novelty

This model architecture focuses on mitigating false negatives (for example, not detecting a pedestrian, a car or a cyclist where one should be detected). The authors present a pipeline which they coin hard instance probing that focuses on detecting hard cases in multiple steps.

Summary

  • Task. Point cloud detection
  • Modality. Point cloud
  • License. Nvidia Source Code License-NC (not for commercial use)
  • Paper
  • Code

GLENet

GLENet is a paper published in June 2023. It’s a method to include ground truth label uncertainty into account when training a model. The researchers of this papers are from, among others, The Chinese University of Hong Kong. They use this method to train popular 3D models and gain state of the art performance (#1 measured by mean average precision) for point cloud object detection on the KITTI dataset.

Novelty

GLENet is a method (not a whole model) that takes into account the uncertainty of ground-truth labels. Point clouds can be sparse so a car may only be represented by a couple of points, making it impossible to give it a deterministic ground truth label.

Illustration of multiple potentially plausible bounding boxes from GLENet on the KITTI dataset by sampling latent variables multiple times. The point cloud, annotated ground-truth boxes, and predictions of GLENet are colored in black, red, and green, respectively. Image and caption taken from GLENet paper.

Summary

  • Task. Point cloud object detection
  • Modality. Point cloud
  • License. Apache 2.0 (you can use this model commercially)
  • Paper
  • Code

PointMLP

Point multilayer perceptron or PointMLP is a model from 2022 by researchers from Northeastern University and Columbia University. It performs well on point cloud segmentation (it ranks #4 on the PointCloud-C dataset measured by mean corruption error).

Novelty

The model is designed to have a simple architecture (i.e., multilayer perceptrons). This makes the model faster than other (more complex) architectures. The authors claim its 2x faster to train and 7x faster during inference. The model performs comparable to other architectures because (according to the authors) detailed local geometrical information probably is not the key to point cloud analysis (the more complex architectures focus on retrieving local geometrical information to improve the model’s performance with the cost of a slower inference time).

Overview of the (simple) PointMLP model architecture. Image taken from PointMLP paper.

Summary

  • Task. Point cloud segmentation
  • Modality. Point cloud
  • License. Apache 2.0 (you can use this model commercially)
  • Paper
  • Code

GDANet

Geometry-disentangled attention network or GDANet is a model from 2021 by researchers from, among others, The University of Hong Kong and University of Chinese Academy of Sciences. It achieves state-of-the-art results on point cloud segmentation (#1 on PointCloud-C benchmark measured by mean corruption error).

Novelty

The authors take inspiration from the computer vision work with 2D images. One strategy is to decompose 2D images into low and high frequency parts (e.g., using Fourier or cosine transforms). These features contain useful information that can be leveraged to built a good internal representation of the objects in the image. The authors use two modules in their architecture.

  • Geometry-disentangle module to dynamically disentangle point clouds into a contour (i.e., the sharp variation component) and flat part (i.e., the gentle variation component) of 3D objects
  • Sharp-gentle complementary attention module that uses the features from sharp and gentle variation components as two holistic representations, and pays different attentions to them while fusing them with original point cloud features

Visualization of a point cloud split into low and high frequency parts. Image taken from GDANet readme on GitHub.

Summary

  • Task. Point cloud segmentation
  • Modality. Point cloud
  • License. MIT (you can use it commercially)
  • Paper
  • Code

(Warning) Using pre-trained 3D models in production

Point cloud models have witnessed a rise in applications in 3D computer vision. Although a model may work exceptionally well on a certain benchmark, it is not guaranteed the same performance when it’s applied to another dataset. Several factors account for this inconsistency which are inherent to the nature of three-dimensional data and model construction.

  • Variations in data quality. Different datasets are assembled using varied methods, equipment, and degrees of precision. As such, disparities ranging from data density to measurement noise can result in a model trained on one dataset encountering difficulties when applied to a second. High-resolution datasets might exhibit details that a model, initially trained on low-resolution data may misinterpret, thereby negatively affecting its performance
  • Diversity in feature representation. Depending upon the construction and purpose of a dataset, the features represented in one dataset may significantly differ from another’s. Distinctive elements, such as color, shape, or texture, may not be expressed identically in each dataset, preventing a model’s ability to transfer its knowledge
  • Underlying distribution. When shifting from one dataset to another, the underlying data distribution can change. This concept, referred to as distribution shift, can lead to a discrepancy between the statistical properties of different datasets. A model trained on a dataset with a particular distribution may fail to generalize the learned patterns effectively when a different distribution is used
  • Contextual and semantic differences. A point cloud model that performs well on one dataset might be closely tied to the specific context or semantics of that data (for example, a self-driving car dataset taken in the city versus a self-driving car dataset taken in on rough terrain). Applying the same model to a different dataset with varied context or semantics will likely result in performance degradation
  • Scale and orientation differences. 3D point cloud data can vary in terms of scale and orientation. A model that works well on a dataset of a particular scale and orientation may not perform accurately on another dataset with different characteristics
  • Annotation inconsistencies. One dataset may use different conventions or standards for annotations compared to another. These inconsistencies can limit the effectiveness of a model as it struggles to interpret unfamiliar annotation schemas

In conclusion, while 3D point cloud models provide powerful tools for interpreting complex spatial data, their effectiveness is fundamentally tied to the quality and characteristics of the dataset on which they are trained. These challenges highlight the importance of training models on data that is closely related to your use case and why you should look into creating your own labeled dataset if you want to get a good performing 3D model. We at Segments.ai can help you with this. Feel free to reach out at hello@segments.ai to discuss your use case.

Overview of state-of-the-art point cloud models

Model Task Modality License Paper and code
BEVFusion Object detection Multi-sensor Apache 2.0 Link
GeomGCNN Classification and segmentation Point cloud N/A Link
EA-LSS Object detection Multi-sensor Apache 2.0 Link
FocalFormer 3D Object detection Point cloud NVIDIA source code license Link
GLENet Object detection Point cloud Apache 2.0 Link
PointMLP Segmentation Point cloud Apache 2.0 Link
GDANet Segmentation Point cloud MIT Link

Summary

We’ve covered 7 point cloud deep learning models that perform state-of-the-art on benchmarks.

Some caveats however are that

  • Point clouds can be very different across multiple types of lidars (for example, because the point cloud density is different)
  • These models are trained on only one type of dataset. This does not directly mean these models perform well on different types of data

To overcome these caveats you should create a training dataset that is closely related to your use case. A data labeling platform like Segments.ai (to label your custom point cloud data) can help to do this fast and cheap. You can then use these labels to fine-tune the above state-of-the-art models to perform great on your use case.

If you have any questions or comments about the article, feel free to reach out to me at arnaud@segments.ai or www.twitter.com/arnaudhillen.

Share this article