Why Autonomous Driving Needs More Than Advanced ML Models

Can large language models like ChatGPT solve autonomous driving?

As AI capabilities advance rapidly, it's tempting to ask whether cutting-edge models like GPT-4 or computer vision models like SAM and DINOv2 could be directly applied to autonomous driving (AD). While these models demonstrate impressive capabilities, there remains a significant gap between what they can do and what's actually needed to solve AD. This article explores why human feedback remains essential for developing safe, reliable autonomous systems.

Put differently: “Can’t we just put GPT-4 into a car, and have our self-driving vehicle done? Looking at this, It seems to understand traffic well!”

Prompt to GPT-4:

Can you provide a detailed description of the image that could be used by an autonomous vehicle to safely navigate that road scenario? Please, focus on the safety relevant aspects.

 

ML perspective

 

Response:
  • The scene depicts an urban intersection with multiple lanes in each direction. Proper lane adherence will be crucial for safety.
  • The traffic lights at the nearest intersection are red, indicating a stop requirement for vehicles approaching from this viewpoint. An autonomous vehicle should come to a complete stop upon detecting this.
  • Overhead, there are tram or trolleybus power lines. Any vehicle system should be aware of these infrastructures, especially if the vehicle is tall or carrying oversized loads.
  • On the left, next to the road, there's a barrier or guardrail, suggesting that there might be ongoing construction or a specific area pedestrians/vehicles should stay clear of. Notably, there are portable toilets positioned near the guardrail. This can be an indicator of prolonged construction work in the area.
  • A pedestrian is visible on the left sidewalk, walking parallel to the road. As they are near the crosswalk, the vehicle should monitor their movement for any potential sudden crossings.
  • There are several pedestrians on the right, waiting at the crosswalk. The vehicle should remain stopped until the lights change and ensure the path is clear of any crossing pedestrians before proceeding.
  • A vehicle is seen in the bottom left corner, suggesting that there might be adjacent lanes the autonomous vehicle should be aware of. Monitoring the behavior of adjacent vehicles is vital, especially in scenarios where sudden lane changes might occur.
  • Multiple high-rise buildings surround the intersection. While these don't directly impact navigation, they could affect GPS signals or other sensors that rely on open skies. The system should be calibrated to handle such urban canyon scenarios.
  • In the background, there's a mix of vehicular traffic, including buses. An autonomous vehicle should be prepared for the unpredictable movement of larger vehicles, especially buses that may stop frequently to pick up or drop off passengers.
  • There are various street signs, including one indicating a pedestrian crossing. Recognizing and adhering to these signs is essential for safety.
  • - Lastly, the road appears dry, indicating regular traction conditions, but the system should always be prepared to assess road conditions in real-time.

Multi-Modal Sensor Data Requires Specialized Processing

Large language models excel with text data, and recent models can process images. However, autonomous driving demands something more complex: real-time fusion of multi-modal sensor data.

AD systems rely on multiple cameras capturing overlapping views, combined with lidar and often radar sensors. Understanding the vehicle's surroundings requires processing hundreds of images per second alongside 3D point cloud data—all with precise temporal and spatial alignment. A typical 5-second sliding window with 10 frames per second from 4 cameras yields 200 images that must be interpreted together, considering exact timestamps and sensor positions.

Cameras alone cannot provide the depth accuracy needed for safe AD. Lidar's 3D point clouds offer unmatched distance measurement capabilities compared to 2D camera images. The true power emerges from sensor fusion—combining camera pixel granularity with lidar depth information.

While research explores applying ML models to lidar point clouds (including LidarCLIP, PointCLIP, and recent work at Kognic), significant work remains to achieve the accuracy needed for safe autonomous operation.

Sensor fusion introduces additional complexity. Models must handle different data types (images and point clouds) while accounting for temporal offsets—lidar points are collected sequentially via scanning patterns, meaning each point has a different timestamp that rarely matches camera image timestamps. At highway speeds, millisecond-level precision in trajectory interpretation can mean the difference between safe navigation and collision.

This is where human feedback becomes essential. Training ML models to handle this complexity requires high-quality annotated data that captures the nuances of multi-modal sensor fusion. Human judgment guides models in learning what matters and how to interpret ambiguous situations correctly.

Real-Time Performance Is Non-Negotiable

Traffic situations evolve in fractions of a second. Consider the EuroNCAP scenario where a child runs into the street from behind parked cars.

In this scenario, approximately 1.5 seconds pass from when the pedestrian clears the parked car until potential collision. Most of that time must be devoted to braking or steering—detection and decision-making must happen within fractions of a second. Every time.

While ML models can process individual images relatively quickly, the margins disappear when processing sequences, running sensor fusion, and comparing potential trajectories. Systems need blazingly fast performance, not just relatively fast processing.

Meeting these performance requirements demands not just fast models, but also efficiently annotated training data. This is why productive annotation platforms matter—they enable rapid iteration on model development, helping teams identify edge cases and validate performance faster.

Safety Requires Consistent, Reliable Performance

Unlike language models where errors typically don't cause physical harm, mistakes in AD systems can result in injury or death. This demands a fundamentally different standard of reliability.

AD systems cannot afford missed detections or false positives. Consider the contrast with medical ML applications like X-ray screening: if an ML model narrows 1,000 X-rays down to the top 100 most critical cases, it reduces manual review by 90% while still catching important cases. This represents significant value even with imperfect accuracy.

In AD, the equation differs. Historically, driver assistance systems prioritized avoiding false interventions (like unnecessary braking) over catching every scenario, since human drivers remained responsible. With full autonomy, systems must handle both—neither missed detections nor false interventions are acceptable.

Achieving uniform performance across all weather conditions, traffic situations, and edge cases remains a fundamental challenge. This is precisely why human feedback is invaluable: it helps identify where models struggle, validates model behavior in critical scenarios, and ensures systems align with human expectations of safe behavior.

Human Feedback Accelerates Machine Learning

Recent ML advances, while not yet a complete solution for AD, point toward an important insight: large models demonstrate impressive reasoning capabilities that could prove valuable for handling unexpected situations in autonomous systems.

One of AD's main challenges is the long tail of unexpected events—scenarios that fall outside the intended operational domain. As models become more capable through self-supervision and simulation, the bottleneck shifts from raw data volume to something more nuanced: human judgment.

Physics doesn't determine when to yield. Simulations don't decide what counts as "safe enough." These require human judgment—understanding social negotiations in traffic, evaluating trajectory safety, and aligning system behavior with human expectations. This judgment becomes training data that machines can learn from, and evidence that regulators and customers can trust.

This is why efficient human feedback matters more than ever. As annotation tasks evolve from drawing bounding boxes to evaluating behavior, ranking trajectories, and validating model decisions, the productivity of annotation platforms directly impacts how quickly autonomous systems can learn and improve. Getting the most annotated data for your budget isn't just about cost efficiency—it's about accelerating the entire development cycle.

At Kognic, we're focused on making human feedback as productive as possible for autonomy development. We recognize that machines learn faster with human feedback, and our platform is designed to maximize the value of every human judgment—whether that's annotating sensor-fusion data, curating critical scenarios, or validating model performance. As the field evolves toward new tasks like trajectory evaluation and behavioral assessment, productive human feedback will remain essential for building autonomous systems that are not just technically capable, but safe, trusted, and aligned with human intent.