Deploying Interactive Machine Learning at Scale

At Kognic, we leverage machine learning to deliver faster, higher-quality annotations for autonomous vehicle perception systems. Deploying ML tools in production—especially for interactive, on-demand use—presents unique challenges. Here's how we built fast, scalable ML services that support thousands of users while maintaining the responsiveness critical to effective human-AI collaboration.

The Challenge: Annotating Autonomous Vehicle Data

Kognic's platform enables precise annotation of sensor data for self-driving cars, including 2D images and 3D LiDAR point clouds. LiDAR (Light Detection and Ranging) uses pulsed lasers to measure distances, creating point clouds where each point represents a real-world surface detection.

Overview

Side view
Close-up
Raw point cloud
 

Manual annotation requires drawing 3D bounding boxes around objects like vehicles and pedestrians, adjusting dimensions frame-by-frame, and maintaining consistent object IDs throughout video sequences. This process is extremely time-consuming and resource-intensive.

Machine learning transforms this workflow. Our ML-powered applications suggest bounding boxes automatically, shifting the annotator's role from creation to validation and refinement. This human-in-the-loop approach reduces annotation time while maintaining the quality standards autonomous systems demand.

"Our goal is to augment human expertise with ML that suggests bounding boxes, enabling annotators to focus on quality validation rather than manual creation."

 

Design Constraints: The 400ms Challenge

We designed an interactive solution where annotators indicate a region of interest and a neural network predicts the bounding box. To feel truly responsive, the Doherty Threshold requires response times under 400ms. We initially used websocket connections to maintain state and pre-download data, achieving inference-only response times.

Initial architecture: user and ML app interaction

The solution worked well for small user groups, but would it scale? Load testing revealed critical limitations. When multiple clients connected simultaneously, the application would begin downloading all their point clouds at once, causing timeouts and reliability issues. We lacked graceful connection management and efficient load distribution across instances.

Sketch of user and ML app interaction
Sketch of user and ML app interaction

"That's cool, but is it scalable?" — and so our architecture journey began.

 

The Solution: Stateless Architecture

Rather than managing persistent connections with hard constraints on instance sizing, we reimagined the architecture. A stateless service doesn't track which point clouds users will access or have accessed—it simply processes each prediction request independently. Each request includes the point cloud file path, the service downloads necessary data, performs prediction, and moves on. This eliminated connection management complexity, though it required on-demand file downloads for every request.

Optimizing download speed required leveraging the point cloud file structure itself. Our point cloud representation uses an octree structure, dividing space into hierarchical octants for progressive rendering. By identifying exactly which files contain the region of interest, we download only the necessary subset rather than the entire point cloud. This optimization brought download times to 300-400ms—within striking distance of our 400ms total response budget.

"We deploy ML algorithms on CPU instances to serve them cost-effectively at scale."

We further reduced latency by deploying with aiohttp, an asynchronous web framework that prevents idle waiting during downloads or predictions. Consistent with our infrastructure philosophy, we deploy ML algorithms on CPU instances, balancing performance with cost efficiency.

Stateless architecture: updated user and ML app interaction

New updated sketch of user and ML app interaction
New updated sketch of user and ML app interaction

Production-Ready Infrastructure

With core performance solved, we focused on production reliability. We designed comprehensive error handling, implemented monitoring for critical metrics (inference times, download times), and configured alerting to catch performance degradation early.

The scaling challenge that initiated this redesign became trivial. Load testing determined CPU and memory requirements, enabling us to implement horizontal scaling policies triggered by CPU utilization thresholds.

 

Scalable ML solution architecture
Scalable ML solution architecture

Key Learnings

The stateless approach delivers significant advantages: easy horizontal scaling, simplified infrastructure, and elimination of complex connection routing. There's always a tradeoff between vertical and horizontal scaling. While unlimited resources would make vertical scaling simpler, at scale it becomes impractical, necessitating architectural rethinking toward stateless, horizontally scalable services.

That said, pure statelessness has tradeoffs. Since point clouds typically contain multiple objects requiring annotation, our current architecture downloads portions of the same point cloud repeatedly. Future iterations may introduce limited state management to optimize this, carefully balancing the complexity of state against the simplicity of horizontal scaling.

Interactive ML applications with humans in the loop demand exceptional responsiveness and user experience. As we expand to more annotators, their feedback continuously shapes our development priorities around latency and usability.

At Kognic, we continuously integrate state-of-the-art ML techniques into our annotation platform to accelerate delivery without compromising quality. Key lessons for future ML deployments:

  • Identify early whether ML services require interactive use, as this fundamentally shapes architecture decisions
  • Proactively identify and address performance bottlenecks before scaling
  • Maximize the value of human-in-the-loop collaboration—instant feedback on predictions, region-of-interest guidance, and temporal context in video sequences provide invaluable ground truth for model improvement

Building scalable ML infrastructure is essential to delivering the quality and efficiency autonomous vehicle development demands.