Common Sense: A Pragmatic Approach to Unit Testing Perception Systems

In today's rapidly evolving autonomous vehicle landscape, validating perception system safety has become the defining challenge of our generation. At Kognic, we specialize in enabling the human feedback necessary for safe and reliable autonomy. While various metrics exist for evaluating perception systems, most lack clear passing criteria or leave significant room for interpretation regarding deployment readiness. This ambiguity creates substantial challenges in confidently validating whether an autonomous perception system is truly road-ready.

In this article, we'll illustrate the limitations of traditional metrics with real-world examples and propose a straightforward yet powerful unit testing approach that establishes clear thresholds for when autonomous vehicles should not be deployed.

The Problem with Traditional Metrics

Picture an autonomous vehicle approaching an intersection as shown above. The vehicle uses a camera-based semantic segmentation network that identifies various elements like roads, vehicles, pedestrians, cyclists, and signage. Below, we'll examine annotated ground truth images from the CityScapes dataset.

Mean Average Precision (mAP) is a standard metric for evaluating semantic segmentation. It calculates precision and recall across different confidence thresholds for each class, generating an averaged performance metric. The fundamental limitation of mAP for deployment decisions is the absence of a definitive threshold that indicates when a system is safe enough for real-world operation.

For instance, a perception system might excel at detecting vehicles and pedestrians but struggle with animals, significantly reducing its overall mAP score. Conversely, a system with high precision for identifying roads, vegetation, and vehicles but poor cyclist detection would still achieve an impressive mAP score - potentially masking critical safety gaps.

Beyond the metric itself, annotation quality inconsistencies further complicate evaluation. In the scenes below, tram line annotations appear in only three of five frames, with one annotation being particularly coarse. While this class may have limited safety impact, it still influences the overall metric score.

Let's examine another case, focusing now on the metric rather than input data. Below are the current benchmarking results for object detection on the KITTI dataset. The precision-recall curves reveal that even for "easy" pedestrian cases, maximum recall remains below 80%. Vehicle detection shows better results, with average precision ranging from 93.35% (difficult cases) to 96.64% (easy cases). But the critical question remains: Is this performance sufficient for safe deployment? The challenge lies in establishing precise performance thresholds for real-world deployment.

Our Proposal: 'Common Sense'

We believe that before any autonomous vehicle enters public roads, it must demonstrate competence in handling fundamental driving scenarios through rigorous unit testing. Our approach uses coarse scene annotations with clearly defined polygons that autonomous systems must recognize. This "common sense" approach leverages human judgment to identify what's truly important in driving scenarios.

Breaking validation into discrete tests enables clear performance expectations. By focusing on critical elements within specific scenes, we can reasonably expect flawless performance in these limited contexts. If a system falls short on these fundamental tests, it's definitively not ready for deployment.

Consider our intersection example: it's absolutely essential that the system recognizes both the pedestrian on the left and vehicles crossing the intersection. These represent critical objects requiring detection.

Equally important is the clear path ahead. The system must correctly identify this area as object-free. False detections in this zone could trigger unnecessary emergency braking, creating both passenger discomfort and potential safety hazards. This demonstrates why testing for both object presence and absence is crucial.