End-to-End Autonomy and the Human Feedback Problem | Kognic

Written by Björn Ingmansson | Jul 1, 2026 6:30:00 AM

Every team building autonomous driving systems is watching the same architectural shift. End-to-end models, which map directly from sensor inputs to driving outputs, are moving into serious production programs across the industry.¹

The assumption most make is: if the model learns everything from sensor input to driving output, the data problem simplifies. You no longer need to label each stage of a perception pipeline separately. Collect more raw driving data, and the model figures out the rest.

That assumption is true in theory. In practice, the cost of compensating for the absence of structured annotation — in raw data volume and compute — is prohibitive for most programs. E2E architectures still benefit from annotation. What changes is what gets added on top. The annotation stack for E2E is a superset of what modular systems require, not a replacement for it.

What End-to-End Actually Means

"End-to-end" is used loosely right now, and the imprecision matters for data strategy decisions.

The real definition is architectural: an end-to-end driving system is trained as a single differentiable system, rather than as a composition of separately optimized modules. You train the whole thing together, backpropagating through the entire network from output back to sensor input.

This is different from the classical modular pipeline, where each stage has an explicit representation and its own training objective:

Perception: detect and classify objects, lanes, drivable area
Prediction: estimate future states of dynamic agents
Planning: generate a safe trajectory given the scene state
Control: translate the trajectory into actuator commands

What "end-to-end differentiable" does not mean is that all intermediate representations disappear. Most production E2E systems have detection heads — specialized output branches for objects, road markings, and traffic signs — even though the entire system trains as one differentiable network. Think of it as: the model outputs steering directives and trajectory, and also outputs vehicles, signs, and lane markings as artifacts of the same forward pass.

Some advocate for strict end-to-end: a pure pixels-to-torque model with no explicit detection heads at all, on the premise that the network will learn whatever internal representations it needs from data alone. The safety argument against strict E2E is practical. If a system cannot produce a 2D bounding box around a stop sign, how do you verify it has the internal representation needed to drive safely past one? Keeping intermediate signals, and the human feedback that shapes them, is how teams maintain inspectability as the architecture becomes more integrated.

Why E2E Makes the Data Problem Harder

The key tension in E2E data strategy is a cost trade-off, not a simplification.

Option A: collect far more raw driving data and spend far more compute, and train a model with no intermediate representation. Technically possible, but budget-prohibitive for all but a handful of organizations.

Option B: inject inductive bias through annotation to accelerate convergence and reduce the compute bill.

A useful analogy: imagine a library where the shelves are completely unordered. To find a specific author, you scan every book. Now imagine the books are sorted alphabetically. The same library, but you go straight to the right shelf. Inductive bias is that ordering — it tells the model where to look, rather than making it discover the structure from scratch.

This is why the E2E data problem is harder than it first appears, not easier. You cannot pull out just the planning module and train it in isolation on a curated dataset. You need complete data. Failures span the whole network. And because learned representations are distributed throughout the system rather than localized to a module, inspection is more expensive.

The numbers make this concrete. Across leading AV programs, only 1–7% of raw collected driving data survives filtering to become usable training scenes — the rest is discarded as duplicate scenarios, misaligned sensor data, or insufficient coverage.³ At that rate, trying to compensate for the absence of annotation purely through data volume is a losing equation for most programs.

Three specific challenges follow.

Failure attribution is harder.

A vehicle that brakes unexpectedly on a clear road may be responding to a pattern in the scene that no explicit module ever computed. Diagnosing whether the problem is in the scene representation, the association between features and decisions, or the quality of the training signal requires more investigation than checking module outputs.

Coverage must include behavior, not just perception.

In a modular system, edge case diversity is primarily perceptual: unusual geometries, rare lighting, adversarial sensor configurations. In an E2E system, behavioral diversity matters as much. The model must have seen enough examples of the right decision in a given behavioral context. A scenario involving an ambiguous pedestrian gesture at an intersection requires behavioral coverage, not just visual coverage.

Quality of the training signal matters more than volume.

More data is not the primary lever. The data must convey not just what happened in a scene, but what the correct response should have been. Volume without signal quality produces a larger, noisier model.

What Human Feedback Looks Like in E2E

Human feedback does not disappear in end-to-end architectures. The annotation stack expands. Different data types get different emphasis at different stages: perception labels come first, trajectory and language signals follow as the stack matures.

Traditional perception labels — 3D and 2D bounding boxes, lane geometry, traffic sign classification — remain valuable as inductive bias. They accelerate convergence and preserve inspectability. Teams that cut perception annotation when moving to E2E typically find that training time increases and model behavior becomes harder to verify. The economics favor keeping them.

On top of the perception foundation, E2E architectures introduce two additional annotation types.

Trajectory preference ranking.

This maps directly onto what the planning module handled in a modular stack. Given a scenario, annotators either rank candidate trajectories by preference, or redraw the driven path when it was suboptimal — for example, shifting a trajectory slightly left when the vehicle clipped a curb. The annotator is expressing a behavioral judgment, not a geometric one. Signal quality matters more than scale here: inconsistent or low-confidence rankings produce a noisy model that propagates uncertainty through the entire training loop.

Language reasoning.

This is the fastest-growing area of demand in E2E annotation right now. The approach borrows the generalizability of large language models — which already reason competently about traffic scenes — and fine-tunes that reasoning for hard driving cases. Examples include legitimately breaking a rule (crossing a solid line to pass a vehicle broken down in the lane, or navigating around construction), and driver-system dialogue ("find the closest parking spot," "drive more carefully through here," "stop at the toll booth — I'm paying cash"). The model learns from these labeled examples how to handle the edge cases that rules alone cannot cover.

A challenge that applies to both preference ranking and language annotation is hindsight bias. If an annotator sees the outcome of a situation before judging the decision made at the time, their annotation reflects the outcome rather than the information actually available at the decision point. Task design must control for this: show annotators only the scene up to the decision point before recording the judgment.

How This Works in Practice

Market demand right now is concentrated on language grounding as a feedback source in E2E stacks. Nvidia's Alpamayo paper, published roughly six months ago, accelerated this shift and demonstrated the practical case for combining traditional annotation with language-based feedback signals.² Simpler "reason about driving" tasks are also emerging alongside the more complex language grounding work.

Honest take: despite a year of VLA hype, no one is running language-reasoning E2E in production yet. It is still a full research project. That is what makes it interesting — teams that secure the annotation capability now are positioned for the transition when it comes.

The formalization of E2E annotation is also becoming visible at a policy level. In 2026, Korea's Ministry of Science and ICT published the country's first national E2E data guidelines for autonomous driving, establishing labeling specifications for perception labels, trajectory, and driving intent as an official government standard.³ When national regulators begin codifying annotation requirements for a technology, the transition from research to production is no longer a question of if.

For teams starting this work, the approach that is working is: collect raw driving data and run self-supervised learning on it as a base, then layer inductive bias incrementally. Start with 3D boxes and lane geometry, add trajectory preference ranking as the planning signal, then build language reasoning capability as task complexity grows.

Start with approximately 10,000 annotations, look for measurable system improvement, then scale up while monitoring the scaling curve. Knowing where the curve flattens tells you whether to invest in more data or shift to a different annotation type.

What This Means for Your Data Strategy

If you are moving toward end-to-end or already running E2E and modular systems in parallel, the practical implications for data operations are significant.

Annotation tooling must expand, not replace.

Tools designed for bounding boxes and semantic segmentation are not built for trajectory preference annotation or language reasoning tasks. The interface, task design, and quality control workflows are different. But this is a layer added to your existing stack, not a replacement for it. Tooling that cannot rank candidate driving decisions, or capture reasoning behind a behavioral judgment, cannot support E2E training at scale.

Domain expertise in annotation matters more, not less.

A geometric annotation can be verified visually. A reasoning annotation that is internally coherent but wrong about driving intent is much harder to catch in QA. The quality bar shifts from geometric accuracy to behavioral understanding. Annotators need genuine domain knowledge, not general-purpose labeling experience.

Plan for parallel operation.

Most production programs do not switch architectures overnight. E2E systems require reasoning and preference data; modular systems still require perception labels. The annotation operation must support both simultaneously, and the datasets are not interchangeable.

Coverage strategy must account for behavioral diversity.

Rare behaviors matter more than rare geometries. A dataset may have thousands of night scenes in fog and very few examples of correct behavior at an ambiguous unsignalized crossing. Closing behavioral coverage gaps requires a different collection and prioritization methodology than closing perceptual coverage gaps.

The Shift That Matters

End-to-end models do not reduce the role of human judgment in the training loop. They expand it.

The shift from modular to end-to-end is not a shift away from traditional annotation. It is a shift to doing more. Teams that treat E2E as a reason to cut annotation investment will find their models inheriting inconsistency from a thinner training signal. Teams that extend their annotation stack — keeping perception labels as inductive bias, adding trajectory preference ranking, building language reasoning capability — are building the data foundation that end-to-end models actually require.

Kognic's platform supports this expanded stack today. Language Grounding — Kognic's capability for adding structured natural language reasoning to driving scenarios — is the annotation type seeing the highest demand as teams prepare for language-reasoning E2E. It runs alongside the full perception and trajectory stack: 3D and 2D bounding boxes, lane geometry, traffic sign classification, and trajectory preference ranking. Teams can build the superset dataset E2E requires without switching annotation platforms as requirements expand from perception to behavior to reasoning.

The data problem in end-to-end autonomy is more interesting than it first appears. Getting it right is where model performance differences compound over time.

If you are evaluating your data strategy for an E2E transition, talk to the Kognic team.

Frequently Asked Questions

What is end-to-end autonomy? End-to-end autonomy refers to a driving system where the entire network — from raw sensor inputs to driving outputs — is trained as a single differentiable system, rather than as separate modules for perception, prediction, planning, and control. Most production E2E systems retain intermediate output heads for objects, lanes, and signs while training the full system end-to-end.

How does end-to-end differ from a modular autonomous driving pipeline? In a modular pipeline, each stage has its own explicit representation and training objective, and failures can be isolated to a specific module. End-to-end systems collapse these stages into a single trained network. The advantage is that the whole system optimizes together; the challenge is that failures span the full network and are harder to attribute and debug.

Does end-to-end autonomy eliminate the need for human annotation? No. The annotation stack expands rather than shrinks. Traditional perception labels — bounding boxes, lanes, signs — remain valuable as inductive bias that accelerates convergence. E2E architectures add new types on top: trajectory preference ranking and language reasoning. Teams that cut perception annotation when moving to E2E typically find training becomes slower and model behavior harder to verify.

What is trajectory preference ranking, and why does it matter for E2E? Preference ranking asks annotators to compare candidate driving trajectories for the same scenario and indicate which is better, or to redraw a trajectory when the recorded one was suboptimal. This maps onto what the planning module handled in a modular stack, and shapes how the E2E model learns to make driving decisions. Signal quality matters more than volume: inconsistent rankings produce a noisy model.

What role does language reasoning play in E2E training data? Language reasoning annotation teaches models to handle edge cases that rules alone cannot cover — such as legitimately breaking a traffic rule to navigate around a hazard, or responding to driver instructions in natural language. It borrows the generalizability of large language models and fine-tunes that reasoning for real driving scenarios. This is currently the fastest-growing area of demand in E2E annotation, accelerated by research such as Nvidia's Alpamayo paper.

End-to-end neural network architectures for autonomous driving are in production at multiple developers. Tesla announced FSD v12 as a fully end-to-end neural network in October 2023, replacing over 300,000 lines of C++ control code with a single neural network pipeline. Via Think Autonomous. Wayve published GAIA-1, a generative world model for autonomous driving, in September 2023. Via arXiv:2309.17080. Note: language-reasoning E2E systems remain a research project as of 2026 — no production deployments confirmed. ↩
Nvidia's Alpamayo R1 paper, "Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail," demonstrated combining traditional perception annotation with chain-of-causation language reasoning in E2E training stacks. Published November 2025. Via arXiv:2511.00088. ↩
Korea Ministry of Science and ICT, "Autonomous Driving E2E Data Construction Guidelines and Specification Definition Document," v1.3, 2026. Prepared by ETRI under the Autonomous Driving Technology Development Innovation Program (RS-2024-00341055). Data efficiency figures (Motional 2–5%, Waymo 1–3%, Cruise 3–7%) cited therein. Original announcement via Daum. ↩↩

View full post