Researchers at the MIT-IBM Watson AI Lab have unveiled an advancement in computer-vision technology that could transform the way autonomous vehicles perceive and interact with their surroundings.
The technology reduces the computational complexity of semantic segmentation, aiming to enhance the safety and efficiency of self-driving vehicles, according to the researchers.
Semantic segmentation is a process in which a deep-learning algorithm assigns a label or category to every pixel in an image. For autonomous vehicles, this means accurately identifying and categorising objects encountered on the road in real-time.
The challenge with existing semantic segmentation models lies in their computational intensity, particularly as image resolution increases. These models learn the interactions between every pair of pixels in an image, resulting in calculations that grow quadratically with higher resolution.
While accurate, these models struggle to process high-resolution images in real-time on edge devices such as sensors or mobile phones.
To address this issue, MIT researchers introduced a new building block for semantic segmentation models, designed to deliver equivalent performance but with only linear computational complexity.
Known as EfficientViT, the technology leverages a vision transformer, a model originally developed for natural language processing that encodes image patches into tokens and generates an attention map to understand context.
When deployed on a mobile device, it performs up to nine times faster than previous models while maintaining or surpassing their accuracy.
The resulting hardware-friendly architecture makes EfficientViT suitable for a wide range of devices, including virtual reality headsets and autonomous vehicle edge computers.
In tests using datasets for semantic segmentation, EfficientViT outperformed other popular vision transformer models on Nvidia GPUs, achieving high speedups while maintaining or improving accuracy.
According to the researchers, this has implications not only for autonomous vehicles but also for various computer-vision tasks such as image classification.
“Our work shows that it is possible to drastically reduce the computation so this real-time image segmentation can happen locally on a device,” said Song Han, an associate professor at the Department of Electrical Engineering and Computer Science (EECS) and a member of the MIT-IBM Watson AI Lab.