Can large language models like ChatGPT solve autonomous driving?
As AI capabilities advance rapidly, it's tempting to ask whether cutting-edge models like GPT-4 or computer vision models like SAM and DINOv2 could be directly applied to autonomous driving (AD). While these models demonstrate impressive capabilities, there remains a significant gap between what they can do and what's actually needed to solve AD. This article explores why human feedback remains essential for developing safe, reliable autonomous systems.
Put differently: “Can’t we just put GPT-4 into a car, and have our self-driving vehicle done? Looking at this, It seems to understand traffic well!”
Prompt to GPT-4:
Can you provide a detailed description of the image that could be used by an autonomous vehicle to safely navigate that road scenario? Please, focus on the safety relevant aspects.
Response:
Large language models excel with text data, and recent models can process images. However, autonomous driving demands something more complex: real-time fusion of multi-modal sensor data.
AD systems rely on multiple cameras capturing overlapping views, combined with lidar and often radar sensors. Understanding the vehicle's surroundings requires processing hundreds of images per second alongside 3D point cloud data—all with precise temporal and spatial alignment. A typical 5-second sliding window with 10 frames per second from 4 cameras yields 200 images that must be interpreted together, considering exact timestamps and sensor positions.
Cameras alone cannot provide the depth accuracy needed for safe AD. Lidar's 3D point clouds offer unmatched distance measurement capabilities compared to 2D camera images. The true power emerges from sensor fusion—combining camera pixel granularity with lidar depth information.
While research explores applying ML models to lidar point clouds (including LidarCLIP, PointCLIP, and recent work at Kognic), significant work remains to achieve the accuracy needed for safe autonomous operation.
Sensor fusion introduces additional complexity. Models must handle different data types (images and point clouds) while accounting for temporal offsets—lidar points are collected sequentially via scanning patterns, meaning each point has a different timestamp that rarely matches camera image timestamps. At highway speeds, millisecond-level precision in trajectory interpretation can mean the difference between safe navigation and collision.
This is where human feedback becomes essential. Training ML models to handle this complexity requires high-quality annotated data that captures the nuances of multi-modal sensor fusion. Human judgment guides models in learning what matters and how to interpret ambiguous situations correctly.
Traffic situations evolve in fractions of a second. Consider the EuroNCAP scenario where a child runs into the street from behind parked cars.
In this scenario, approximately 1.5 seconds pass from when the pedestrian clears the parked car until potential collision. Most of that time must be devoted to braking or steering—detection and decision-making must happen within fractions of a second. Every time.
While ML models can process individual images relatively quickly, the margins disappear when processing sequences, running sensor fusion, and comparing potential trajectories. Systems need blazingly fast performance, not just relatively fast processing.
Meeting these performance requirements demands not just fast models, but also efficiently annotated training data. This is why productive annotation platforms matter—they enable rapid iteration on model development, helping teams identify edge cases and validate performance faster.
Unlike language models where errors typically don't cause physical harm, mistakes in AD systems can result in injury or death. This demands a fundamentally different standard of reliability.
AD systems cannot afford missed detections or false positives. Consider the contrast with medical ML applications like X-ray screening: if an ML model narrows 1,000 X-rays down to the top 100 most critical cases, it reduces manual review by 90% while still catching important cases. This represents significant value even with imperfect accuracy.
In AD, the equation differs. Historically, driver assistance systems prioritized avoiding false interventions (like unnecessary braking) over catching every scenario, since human drivers remained responsible. With full autonomy, systems must handle both—neither missed detections nor false interventions are acceptable.
Achieving uniform performance across all weather conditions, traffic situations, and edge cases remains a fundamental challenge. This is precisely why human feedback is invaluable: it helps identify where models struggle, validates model behavior in critical scenarios, and ensures systems align with human expectations of safe behavior.
Recent ML advances, while not yet a complete solution for AD, point toward an important insight: large models demonstrate impressive reasoning capabilities that could prove valuable for handling unexpected situations in autonomous systems.
One of AD's main challenges is the long tail of unexpected events—scenarios that fall outside the intended operational domain. As models become more capable through self-supervision and simulation, the bottleneck shifts from raw data volume to something more nuanced: human judgment.
Physics doesn't determine when to yield. Simulations don't decide what counts as "safe enough." These require human judgment—understanding social negotiations in traffic, evaluating trajectory safety, and aligning system behavior with human expectations. This judgment becomes training data that machines can learn from, and evidence that regulators and customers can trust.
This is why efficient human feedback matters more than ever. As annotation tasks evolve from drawing bounding boxes to evaluating behavior, ranking trajectories, and validating model decisions, the productivity of annotation platforms directly impacts how quickly autonomous systems can learn and improve. Getting the most annotated data for your budget isn't just about cost efficiency—it's about accelerating the entire development cycle.
At Kognic, we're focused on making human feedback as productive as possible for autonomy development. We recognize that machines learn faster with human feedback, and our platform is designed to maximize the value of every human judgment—whether that's annotating sensor-fusion data, curating critical scenarios, or validating model performance. As the field evolves toward new tasks like trajectory evaluation and behavioral assessment, productive human feedback will remain essential for building autonomous systems that are not just technically capable, but safe, trusted, and aligned with human intent.