Late vs early sensor fusion: a comparison

2 min read -
Avatar photo
- May 22nd, 2024 -

Sensor fusion is the process of combining data from multiple sensors (e.g. multiple cameras, lidars, and radars) to obtain a more accurate perception of the environment than what could be obtained by any individual sensor alone.

It is a key technology in applications such as autonomous driving and robotics, which require accurate scene understanding to navigate safely.

Multimodal sensor fusion approaches for deep learning

Multimodal fusion combines different types of data from various sources to improve prediction accuracy. Essentially, each data source provides unique and helpful information that complements the others.

Handling multimodal data is crucial in robotics and for autonomous vehicles, as distinguishing between a pedestrian and other objects in a street scene is essential.

There are two main approaches to multimodal fusion:

  1. Late fusion (also called high-level fusion): the data of each sensor is processed independently to make a local prediction. These individual results are then combined at a higher level to make the final fused prediction.
  2. Early fusion (also called low-level fusion): the raw data from different sensors is combined before any high-level processing or decision-making. The fused data is then used as input to a machine-learning model.

Late sensor fusion

Late fusion processes each sensor independently and then combines the individual perception results. Here’s an example of a typical pipeline to detect objects in a sensor setup with a single lidar and multiple cameras:

  • Run 3D object detection on the lidar point cloud
  • Run 2D object detection on the camera images
  • Project the 2D detections into the 3D space, fusing them with the 3D detections

Typically, the projected 2D detections are used to determine the category of the 3D detection, which is often hard to predict from the point cloud data alone, such as distinguishing between a car and a van.

: "Diagram illustrating late multimodal sensor fusion technique. The process includes separate radar, video, and image data analysis, which are then merged using an ad-hoc fusion algorithm to produce a comprehensive visualization of traffic, including cars and pedestrians, on urban streets.

What are the (dis)advantages of late fusion?

Late fusion is modular and more fault-tolerant than early fusion since each sensor operates independently: if one sensor fails, the system can keep operating. It’s usually also less computationally intensive than early fusion.

The main disadvantage is that the perception models only see data from one sensor at a time, so they can’t leverage any cross-sensor interactions.

Early sensor fusion

Early fusion combines the raw data from multiple sensors before running a perception algorithm. There is growing interest in end-to-end early fusion approaches that directly map the raw sensor inputs to object detections using a single deep neural network without needing hand-crafted fusion algorithms.

Tesla is a pioneer in the end-to-end early fusion approach, having presented its network architecture on Tesla AI Day 2021. Tesla is increasingly moving towards a fully end-to-end architecture, where a single deep neural network predicts not just object detections but actual driving commands (steering wheel angle and acceleration) directly from the raw sensor inputs, without any hand-crafted fusion or planning algorithms in between.

Illustration of early multimodal sensor fusion technique, showing initial integration of radar and video data. This data is processed together through a single algorithm, leading to a detailed visualization of traffic dynamics including vehicles and pedestrians on a city street.

What are the (dis)advantages of early fusion?

Early fusion concatenates raw sensor data, which allows the neural network to exploit correlations between low-level features of the different sensors. This provides a more information-rich input to the learning models, but it also increases the dimensionality of the feature space. This can make learning more difficult, especially with limited training data.

The approach is also less modular and tightly couples the sensors, making it more sensitive to noise and sensor failures.

Future direction

As machine learning techniques advance, we can expect to see more robust and adaptable sensor fusion systems that can reliably interpret complex environments. Some key open challenges include:

  • Enhancing the robustness of early fusion approaches to sensor noise and failures.
  • Increasing real-time inference efficiency on resource-constrained platforms.
  • Adapting fusion models to new environments via transfer learning
  • Moving towards fully end-to-end models, directly predicting steering commands from raw sensor data

Sensor fusion plays a pivotal role in robotics and automotive sectors, improving perception by integrating data from multiple sensors. Continued research into robust fusion models and architectures will further extend the capabilities of autonomous systems.

If you’re looking to get your multi-sensor data labeled with consistent object IDs across time and sensors, contact us!