How We Achieved 7.4x Productivity Gains in ML Data Labeling: A Case Study in Systems Thinking

Written by Björn Ingmansson | Jun 4, 2025 8:05:52 AM

Over the last few months, we've increased our ML data labelling productivity by 7.4x — without sacrificing quality or team morale. That's not a typo.

This post outlines the experiments we ran, what worked, what didn't, and how we applied systems thinking, behaviour design, and guided workflows to unlock people's and automation's latent potential.

The Problem: Expert Productivity vs. Team Reality

When we looked at internal performance data, we noticed a massive gap: our most experienced annotators were labelling at 3.5x the speed of the broader team.

Naturally, we asked: Can this gap be closed?

But even more importantly, can low-performing schools be closed at scale?

Experiment 1: Training like a high-performance sport

We took inspiration from the book Peak by Anders Ericsson. The core idea: experts aren't born, they're trained through deliberate practice, fast feedback loops, and constant challenge.

We applied this rigour to our annotation workforce:

Expert annotators became coaches.
Feedback loops were shortened.
Teams were pushed just outside their comfort zones.

Result: Within 3 months, the team not only caught up to expert-level performance — they surpassed it.

But while this was effective, it wasn't scalable. Having experts coach individuals doesn't scale to large, dynamic teams. So we moved to systemic solutions.

Experiment 2: Embedding Feedback in the Workflow

We reworked our KPI system to make it:

Visible (via the homepage dashboard)
Socially contextualised (via peer performance benchmarks)
Instantly responsive (live updates based on task flow)

This created a behavioural feedback loop that nudged annotators toward improvement without constant human intervention.

Result: Productivity gains across the board, with increased job satisfaction and no increase in stress.

Yes — productivity and morale both improved. When people know what "good" looks like and are supported in reaching it, they rise to the occasion.

Experiment 3: Making Automation Work

Let's be honest: Much ML automation in labelling tools feels like a half-finished promise.

We built a "stationary object" automation tool years ago. Technically, it worked, but it wasn't being used effectively. Annotators would default to frame-by-frame adjustments. We had given them a Photoshop toolbox and assumed they'd find the optimal workflow independently.

They didn't.

We implemented guided workflows — UI nudges and context-aware suggestions — to steer users toward the most efficient annotation path, using automation where it made sense.

Result: Labelling stationary objects went from 207 seconds down to 24 seconds. That's nearly a 9x improvement from a tool underutilised for years.

Rethinking Auto Labels

Auto labels were supposed to save time. In practice, they didn't. Why?

Because checking and correcting auto labels often takes as much time as manual labelling, especially when the system's accuracy is only 70–90%.

But what if we don't treat auto labels as the final output but as guidance?

We built an interaction model where:

Auto labels suggested candidate objects.
Users confirmed or corrected them with smart defaults.
The system guides users to optimal views for measurement.
Automation filled in the rest via interpolation

Result: When appropriately applied, this workflow led to a 3x productivity gain — even when the base auto labels were "bad."

So What Moved the Needle?

It wasn't just automation. It wasn't just training. It wasn't one silver bullet.

It was:

Treating performance like a system to be designed
Training like a sports team, not a support team
Embedding fast feedback loops into the workflow
Using guided workflows to unlock automation value
Redesigning interaction models for how humans + machines collaborate

These strategies led to a 7.4x increase in shapes per Euro. Real productivity, in production environments.

What's Next: QA as the New Bottleneck

As annotation gets faster, QA becomes the following constraint. The good news? The same principles we applied here — coaching, behaviour design, guided workflows — are also proving effective in QA.

We're already applying them.

Final Thought

If you're building ML labelling pipelines, here's the lesson: it's not just about better models or automation.

It's about how people use them — and whether your systems help them improve over time.

We've found that expert performance is trainable, scalable, and improvable — if you design for it.

Want to dive deeper into the workflows or data behind this? I'd be happy to chat. You can find me on LinkedIn or at olof.wahlstrom@kognic.com.

View full post