Common Sense: A Pragmatic Approach to Unit Testing Perception Systems

Common Sense: A Pragmatic Approach to Unit Testing Perception Systems

In today's rapidly evolving autonomous vehicle landscape, validating perception system safety has become a defining challenge. At Kognic, we believe machines learn faster with human feedback — and that this feedback must be cost-efficient, reliable, and scalable. While various metrics exist for evaluating perception systems, most lack clear passing criteria or leave significant room for interpretation regarding deployment readiness. This ambiguity creates substantial challenges in confidently validating whether an autonomous perception system is truly road-ready.

In this article, we'll illustrate the limitations of traditional metrics with real-world examples and propose a straightforward yet powerful unit testing approach that establishes clear thresholds for when autonomous vehicles should not be deployed. This approach demonstrates how targeted human feedback can accelerate machine learning while ensuring safety-critical validation.

The Problem with Traditional Metrics

Picture an autonomous vehicle approaching an intersection as shown above. The vehicle uses a camera-based semantic segmentation network that identifies various elements like roads, vehicles, pedestrians, cyclists, and signage. Below, we'll examine annotated ground truth images from the CityScapes dataset.

Mean Average Precision (mAP) is a standard metric for evaluating semantic segmentation. It calculates precision and recall across different confidence thresholds for each class, generating an averaged performance metric. The fundamental limitation of mAP for deployment decisions is the absence of a definitive threshold that indicates when a system is safe enough for real-world operation. Without clear passing criteria, teams struggle to determine when their models have learned enough from human feedback to be deployment-ready.

For instance, a perception system might excel at detecting vehicles and pedestrians but struggle with animals, significantly reducing its overall mAP score. Conversely, a system with high precision for identifying roads, vegetation, and vehicles but poor cyclist detection would still achieve an impressive mAP score — potentially masking critical safety gaps that human reviewers should flag.

Beyond the metric itself, annotation quality inconsistencies further complicate evaluation. In the scenes below, tram line annotations appear in only three of five frames, with one annotation being particularly coarse. This illustrates why consistent, high-quality human feedback is essential — inconsistent ground truth leads to unreliable validation, regardless of the metric used.

Let's examine another case, focusing now on the metric rather than input data. Below are the current benchmarking results for object detection on the KITTI dataset. The precision-recall curves reveal that even for "easy" pedestrian cases, maximum recall remains below 80%. Vehicle detection shows better results, with average precision ranging from 93.35% (difficult cases) to 96.64% (easy cases). But the critical question remains: Is this performance sufficient for safe deployment? The challenge lies in establishing precise performance thresholds that human judgment can validate for real-world deployment.

Our Proposal: 'Common Sense'

We believe that before any autonomous vehicle enters public roads, it must demonstrate competence in handling fundamental driving scenarios through rigorous unit testing. Our approach leverages human feedback to define what's truly critical in driving scenarios through coarse scene annotations with clearly defined polygons that autonomous systems must recognize.

This "common sense" approach demonstrates a key principle: human feedback is most valuable when focused on safety-critical elements. Rather than annotating every pixel, we direct human attention to the scenarios and objects that matter most for safe operation. Breaking validation into discrete tests enables clear performance expectations and makes the best use of scarce human judgment.

Consider our intersection example: it's absolutely essential that the system recognizes both the pedestrian on the left and vehicles crossing the intersection. These represent critical objects requiring detection. Human annotators identify these safety-critical elements, enabling machines to learn what truly matters for autonomous operation.

Equally important is the clear path ahead. The system must correctly identify this area as object-free. False detections in this zone could trigger unnecessary emergency braking, creating both passenger discomfort and potential safety hazards. This demonstrates why testing for both object presence and absence is crucial — and why human feedback must validate both what the model sees and what it shouldn't see.