Why End-to-End Learning Needs Human Expertise More Than Ever

Written by Daniel Langkilde CEO and Co-founderdaniel@kognic.com | Feb 10, 2026 9:39:44 AM

The autonomous driving industry spent a decade teaching machines to see. The next chapter is teaching them to reason. And that changes everything about how we train them.

I’ve been thinking about this for months, and I keep arriving at the same conclusion: we must move beyond the data we have produced for the last decade. We have been annotation “What”, but the frontier is now “Why”.

For years, autonomous driving followed a modular approach. Separate systems for perception, prediction, and planning. Each module needed its own data — bounding boxes, segmentation masks, trajectory labels. The annotation task was clear: label what is in the scene.

That made sense. It was the right approach for our level of maturity. Without the “What”, the “Why” doesn’t make sense.

End-to-end pushes a lot of the perception work onto the neural networks. The goal is to let the networks figure out which representation of the world makes it able to drive. Just find a representation that makes you able to imitate the expert trajectories.

Reality is that this method also benefits from supervision to accelerate learning.

But let's be precise about what "end-to-end" means in practice. The goal is to let neural networks find representations that let them imitate expert drivers. That's what makes driving smoother and better — the network discovers features that work, not features that a human engineer assumed would work.

Supervision accelerates this. You're not replacing the network's learning — you're helping it quickly grasp critical concepts it would otherwise need millions of miles to figure out on its own. Think of it as the difference between learning to drive by trial and error versus learning to drive with an instructor who can explain why you yield at this specific intersection.

And realistically? Production stacks will be an ensemble of techniques for the next several years. Pure end-to-end, pure modular — those are poles on a spectrum, not a binary choice. Most teams shipping real products are somewhere in between, and that's fine. The data challenge exists regardless of where you sit on that spectrum.

The "why" problem

Here's what I find genuinely interesting about this shift. It doesn't eliminate the need for human expertise. It makes it more critical.

In the modular world, annotation was mechanical. What is that object? Where is it? How fast is it moving? Important questions — but fundamentally about describing scenes. You could train people to do this relatively quickly.

Reasoning annotation is different. Consider a pedestrian stepping off the curb at an intersection. A perception system needs to detect them and estimate trajectory. Fine. An end-to-end model needs to understand why the vehicle should yield — the pedestrian has right of way, they're committed to crossing, maintaining speed would create an unsafe situation.

That chain — observation to judgment to decision — is what these models need to learn. And it can only come from people who understand driving behaviour deeply.

Most "reasoning" datasets aren't

Open a typical driving dataset that includes textual descriptions and you'll find things like:

👉 "The ego vehicle should be cautious and watch out for pedestrians"

👉 "Sunny weather and wide roads" — cited as the reason for a driving decision

A model trained on "be cautious" learns nothing actionable. It's the annotation equivalent of telling a student driver "just be careful out there."

There's a deeper problem here, and it's architectural. Transformers — the models at the heart of every serious autonomous driving stack right now — are fundamentally correlation engines. They're extraordinary at finding patterns in data. But correlation is not causation.

Without structured causal data, these models memorize. They learn that this specific scenario leads to this specific action. That works until the scenario changes. A pedestrian approaches from an angle the model hasn't seen. A construction zone creates a lane configuration that doesn't match the training distribution. The model that memorized can't generalize. The model that learned causal reasoning can.

Causality is the big unlock. And it won't emerge from more data, bigger models, or longer training runs. It has to be taught explicitly.

What you actually need: structured causal reasoning. A chain that connects what was observable → why it matters → what to do → how it's done. Every claim grounded in evidence. Every decision from a closed taxonomy — no vague language allowed.

NVIDIA's Alpamayo R1 research showed what happens when you get this right: 12% improvement in planning accuracy on challenging scenarios, 35% reduction in close encounters, 45% improvement in reasoning quality.

The gap between good and bad reasoning data isn't incremental. It's the difference between a model that can handle edge cases and one that can't.

What rigorous reasoning annotation actually looks like

This is where language becomes the bridge. By adding structured language — not free-form descriptions, but grounded causal reasoning — you get a way to explain causality to the models. Language is the medium through which human causal understanding becomes machine-learnable signal.

But it's not as simple as asking annotators to "describe the scene." (I wish it were.). It requires methodology — structure that prevents hindsight bias, enforces precision, and produces reasoning traces models can actually learn from.

Alpamayo R1 laid out one of the most rigorous approaches to date — a five-step Chain of Causation framework. It's worth understanding in detail, because it illustrates the gap between what most people call reasoning data and what actually works.

Clip selection — not all driving footage is worth annotating. This is a point that sounds obvious but has massive cost implications. The methodology focuses on moments with explicit driving decisions: a yield, a lane change, a stop. Reactive (the lead vehicle brakes suddenly) or proactive (preparing for a construction zone ahead). Empty highway cruising? Teaches a reasoning model nothing. Skip it.

Keyframe identification — pinpointing the exact moment of decision within each clip. Everything before the keyframe is history (what the driver could observe). Everything after is future (what actually happened). This boundary matters enormously.

Critical components — from history only. What objects, traffic controls, road conditions were observable before the decision point? This prevents a pernicious problem: citing future events as reasons for past decisions. You can't say "the driver braked because the pedestrian crossed" if the pedestrian hadn't started crossing yet at the decision point. Sounds obvious. Almost nobody enforces it.

Driving decision — closed taxonomy. Instead of free text (which produces the vague language problem), decisions are selected from specific categories. Longitudinal: yield, follow lead vehicle, stop for constraint. Lateral: lane keeping, lane change, nudge. Every decision maps to a verifiable behaviour. No more "be cautious."

Finally, everything assembles into a Chain of Causation — observation → causal factor → decision → action.

Here's what one actually looks like:

That's not a description. It's a reasoning trace. Every claim grounded in what was observable, every decision traceable to specific causal factors, every action verifiable against the actual vehicle trajectory.

This isn't RLHF for cars

People keep comparing this to RLHF for ChatGPT. It's not. Not even close.

RLHF for text: humans rank which response sounds better. The expertise required is general literacy and common sense. You can scale this with relatively general annotators.

Reasoning annotation for driving: domain experts who understand traffic dynamics, vehicle physics, regulatory context, and the subtle judgment calls experienced drivers make instinctively. Plus methodology that prevents hindsight bias — annotators who already know what happened next can't be allowed to contaminate their assessment of what should happen.

You cannot crowdsource this. The domain expertise is non-negotiable.

Where we are — honestly

I'll be direct about where Kognic fits. We've spent years building expertise in the what — the most productive annotation platform for autonomous driving. We know perception annotation deeply.

Now we're applying that expertise to the why. We've taken the principles behind research like Alpamayo — structured causation, temporal discipline, closed taxonomies — and we're building them into a practical workflow for production scale. Language as the bridge between human causal understanding and model learning. More on that soon.

Our platform already supports the full spectrum of reasoning annotation: creating causal descriptions, refining model-generated proposals, ranking between alternatives, capturing behaviour trajectories. Teams across the autonomous driving industry are making this shift, and the solutions start with better data.

I won't pretend there's an established standard for what "good" driving reasoning data looks like. There isn't. Not yet. The teams that develop rigorous methodologies now — structured causal traces, temporal bias prevention, consistency verification — will define how the next generation of autonomous vehicles learn to drive.

I'm genuinely not sure most teams have started thinking about this seriously enough. But the ones who have? They're going to move very fast.

View full post