For the past decade, the autonomous vehicle industry has been laser-focused on one metric: annotation throughput. How many bounding boxes per hour? How fast can we label a million frames? The implicit assumption was that more labelled data would solve our problems.
But that assumption is breaking down.
Today's challenge isn't labelling faster—it's finding the right data to label in the first place.
As fleets grow and sensor data accumulates at petabyte scale, autonomy teams face a new reality: the vast majority of collected data is routine and redundant. Highway cruising in perfect weather doesn't teach your model anything new after the ten-thousandth example.
What matters are the edge cases buried in that mountain of data: the construction worker in a high-vis vest stepping between cones, the wheelchair user crossing at dusk, the delivery truck with non-standard markings. These rare scenarios—often representing less than 0.1% of collected data—are where models actually learn and improve.
The bottleneck has shifted from annotation capacity to data curation: the ability to rapidly identify, prioritise, and surface the specific scenarios that will move your model's performance forward.
Many teams today rely on automatic ranking algorithms to filter data before annotation. These algorithms are useful but noisy, as they mix genuinely valuable edge cases with false positives and redundant samples. When teams send top-ranked data directly to annotation without validation, they end up paying full annotation prices for low-value data.
The result? Wasted budget, slower iteration cycles, and models that don't improve where it matters most.
The solution isn't better ranking algorithms alone—it's treating curation as a scalable annotation challenge. Just as we built industrial-scale workflows to label millions of bounding boxes, we now need workflows to validate and triage millions of candidate scenarios.
This means:
By making curation measurable, repeatable, and cost-efficient, we ensure annotation resources focus entirely on confirmed high-value samples. This dramatically improves both efficiency and model relevance.
This transition from volume-first to value-first data operations marks a significant milestone in the industry's maturity. Teams that master data curation will:
As foundation models and end-to-end learning mature, the role of human-in-the-loop will continue to evolve—from drawing boxes to teaching judgment, from labelling everything to curating what matters.
The question isn't whether your team can annotate faster. It's whether you can find the right data to annotate in the first place.
That's where tomorrow's competitive advantage lies.