What is semantic segmentation for autonomous vehicles?

Semantic segmentation for autonomous vehicles is the task of assigning a class label to every pixel in a camera image, or to every point in a LiDAR point cloud. The output is a dense map where each position is tagged as road, vehicle, pedestrian, building, sky, or another class defined in the model's ontology. AV perception stacks use it for drivable-area estimation, lane geometry, and scene understanding.

What is the difference between semantic segmentation and object detection?

Object detection draws a bounding box around each instance of a class and returns class plus location. Semantic segmentation classifies every pixel and returns a dense map with no separation between instances of the same class. Two pedestrians side by side are one contiguous region in semantic segmentation; they are two distinct boxes in detection. Instance and panoptic segmentation combine both.

What is the difference between 2D and 3D semantic segmentation?

2D semantic segmentation classifies camera image pixels and benefits from dense spatial resolution and mature deep learning architectures (DeepLab, Mask2Former, SegFormer), but provides no depth and degrades in low light or bad weather. 3D semantic segmentation classifies LiDAR point cloud points and provides metric 3D class labels usable directly by the planner, but point clouds are sparser and small objects can be represented by very few points. Production AV stacks use both and fuse them.

What classes does semantic segmentation use for autonomous driving?

Most AV teams start from Cityscapes (19 classes across flat, construction, object, nature, sky, human, and vehicle categories) or Mapillary Vistas (over 60 classes). Production ontologies extend further with lane-marking subclasses, pedestrian types, emergency vehicles, construction equipment, traffic cones, debris, and weather-related conditions like wet, snowy, or icy road surface.

How is semantic segmentation annotation quality measured?

Four metrics matter: pixel-level accuracy (mean intersection-over-union, mIoU, against reference labels, typically targeting above 0.9 for common classes), boundary precision (boundary IoU or trimap metrics that weight class-edge pixels), class consistency across frames (a pedestrian in frame N must remain a pedestrian in frame N+1), and inter-annotator agreement (mIoU between two independent annotators on the same frame).

24/06 2026 ・ Articles/

Semantic Segmentation for Autonomous Vehicles: Technical Guide

Semantic segmentation for autonomous vehicles is the task of assigning a class label to every pixel in a camera image, or to every point in a LiDAR point cloud. The output is a dense map where each position is tagged with what it is: road, vehicle, pedestrian, building, sky. AV perception stacks use it to estimate drivable area, separate lane geometry, and reason about background context that detection models miss.

Bounding boxes tell a vehicle what is in a scene. Semantic segmentation tells it what every pixel belongs to. That distinction matters when a car needs to decide if a patch of grey ahead is asphalt, a pothole, a puddle, or a piece of fallen cargo. A detector trained on boxes will not tell you where the drivable surface ends. A segmentation model will.

This guide covers semantic segmentation for autonomous vehicles end to end: what it is, why AV teams rely on it, the classes that matter, 2D versus 3D, annotation techniques, quality bars, common failure modes, and the practices that separate training data that helps from training data that misleads.

What Is Semantic Segmentation

Semantic segmentation is the task of assigning a class label to every pixel in an image, or to every point in a point cloud. The output is a dense map where each position is tagged with what it is: road, vehicle, pedestrian, building, sky.

Unlike object detection, which draws a box around each instance, semantic segmentation does not separate individual objects of the same class. Two pedestrians standing together are one contiguous "pedestrian" region. When teams need to distinguish instances, they use instance segmentation or panoptic segmentation, which combines semantic and instance labels in one output.

For autonomous vehicles, the value of semantic segmentation is spatial completeness. Every pixel is classified. Nothing is left ambiguous. The perception stack knows what occupies every part of the image plane or the 3D environment around the vehicle.

Why Semantic Segmentation Matters for AVs

Modern AV stacks use semantic segmentation as the foundation for several safety-critical tasks.

Drivable area and free-space estimation

The core question for a planner is: where can the vehicle go? Semantic segmentation produces a pixel-level mask of road surface versus everything else. Combined with depth from stereo, LiDAR, or a monocular depth model, this yields a free-space map that tells the planner which parts of the scene are safe to occupy.

Lane-level understanding

Lane markings, lane boundaries, road edges, and traffic islands all benefit from pixel-level labels. A segmentation model that separates "drivable lane," "adjacent lane," "shoulder," and "off-road surface" supports accurate lateral control and lane change decisions.

Background and scene context

Detection models focus on salient objects. Segmentation covers the rest: buildings, vegetation, sky, fences, walls. This context improves depth estimation, helps distinguish permanent structure from dynamic obstacles, and supports scene understanding tasks that depend on the full visual field.

Cross-modal consistency

When 2D image segmentation is fused with 3D LiDAR segmentation, the system gets a consistent class label for every observable part of the environment, across sensors. That consistency is what allows the rest of the stack to reason about occlusion, tracking, and prediction reliably.

In short, semantic segmentation is what lets an AV say "I understand the whole scene" rather than "I found the objects I was trained to find."

Classes and Ontology for AV Applications

A segmentation ontology defines the classes the model must distinguish. Cityscapes and Mapillary Vistas are the two academic baselines that most AV teams start from, but production ontologies always extend beyond them.

The Cityscapes ontology includes 19 evaluation classes grouped into categories:

Flat: road, sidewalk, parking, rail track
Construction: building, wall, fence, guard rail, bridge, tunnel
Object: pole, traffic sign, traffic light
Nature: vegetation, terrain
Sky: sky
Human: person, rider
Vehicle: car, truck, bus, motorcycle, bicycle, caravan, trailer, train

Mapillary Vistas extends this to over 60 classes, separating different types of traffic signs, road markings, and barriers.

Production AV teams extend further depending on the operational design domain:

Road surface subclasses: lane markings by type, crosswalks, stop lines, painted arrows, speed bumps
Dynamic actors: pedestrians split into adults, children, construction workers, emergency personnel
Vehicle subclasses: emergency vehicles, construction equipment, two-wheelers with and without riders, trailers
Small objects: traffic cones, barriers, debris, spilled cargo
Conditions: wet road, snow-covered road, ice patches for regions where weather matters

Ontology design is iterative. Start with a baseline, run models, find where class confusion causes planning errors, and split or merge classes accordingly. Too granular creates annotation cost and class imbalance. Too coarse loses information the planner needs.

2D vs 3D Semantic Segmentation

Semantic segmentation applies to both camera images and 3D point clouds, but the constraints are different.

2D semantic segmentation operates on image pixels. It benefits from dense spatial resolution, mature deep learning architectures (U-Net, DeepLab, Mask2Former, SegFormer), and large public datasets. The main limitations: 2D segmentation does not give you depth, and performance degrades under motion blur, glare, weather, and low light.

3D semantic segmentation operates on LiDAR point clouds or voxelized volumes. It gives you class labels in metric 3D space, which the planner can use directly without depth inference. The main limitations: point clouds are sparse compared to images, annotations are harder to produce, and small objects can be represented by just a few points.

Production AV stacks typically use both and fuse them. Camera segmentation contributes dense spatial detail. LiDAR segmentation contributes reliable geometry and works in the dark. For a deeper treatment of the 3D side, see understanding 3D semantic segmentation.

The labeling implications: a single scene may need two segmentation passes, with a consistent ontology between them, and annotators who can reason about both modalities. Multi-sensor fusion support is a core requirement of any annotation platform for AV.

Annotation Techniques and Tools

Producing high-quality semantic segmentation labels at scale is the hardest operational problem in this domain. The techniques in use today span four generations of tooling.

Polygon-based annotation is the original approach. Annotators draw closed polygons around each region and assign a class. It is precise but slow for vegetation, fences, and other shapes with complex boundaries. Polygon tools remain useful for hard edges and small objects.

Brush and bucket tools let annotators paint pixels directly with a class brush, or fill connected regions with a class bucket. They are fast for broad areas like road, sky, and building facades, but require careful boundary work to match object edges.

Superpixel-assisted annotation pre-segments the image into small regions of similar color and texture, then lets annotators assign classes to groups of superpixels. This accelerates large-area labeling while keeping boundaries reasonably accurate.

Model-assisted pre-labeling runs a segmentation model on the data first and presents the predicted mask to the annotator. The annotator corrects errors rather than labeling from scratch. For Kognic customers, pre-labeling with human-in-the-loop review produces time savings of up to 68% on semantic segmentation annotation.

Foundation model assistance is the current frontier. Models such as Segment Anything (SAM) produce class-agnostic masks that follow object boundaries without being trained on your specific ontology. Annotators click a region, the model proposes a mask, the annotator accepts or corrects it, then assigns a class. SAM handles the geometry, your task-specific model handles the class.

The right combination depends on the class, the scene, and the quality bar. High-frequency classes with messy boundaries suit brush plus superpixels. Rare and small classes suit polygons with foundation-model assistance. Everything benefits from pre-labeling when the model is good enough.

Quality Requirements

Semantic segmentation annotation has four quality dimensions that matter for training.

Pixel-level accuracy

The label for each pixel must match its true class. Mean intersection-over-union (mIoU) between the annotation and a reference standard is the standard metric. For AV training data, teams typically target mIoU above 0.9 against reference labels on common classes.

Boundary precision

Errors cluster at class boundaries: the edge of a car against road, the edge of a pedestrian against building. Because boundaries are where planners make decisions, boundary errors carry more cost than interior errors. Quality metrics that weight boundary pixels (for example, boundary IoU or trimap-based metrics) often reveal issues that overall mIoU masks.

Class consistency across frames

A pedestrian labeled in frame N must be labeled in frame N+1. A road surface tagged as "drivable" must not flicker to "shoulder" between consecutive frames. Temporal consistency is especially important for models that train on video or that depend on stable segmentation for tracking.

Inter-annotator agreement

When two annotators label the same frame, how well do their masks agree? Agreement below a threshold (for example, mIoU below 0.85) signals ontology ambiguity, poorly written guidelines, or insufficient training. Measuring agreement on a recurring sample of frames is a reliable way to catch drift before it reaches the training set.

Kognic's platform runs over 90 automated quality checkers built for AV data. For semantic segmentation these include boundary coherence, class distribution checks, temporal consistency checks across sequence frames, and cross-sensor consistency checks when image and point cloud labels are produced together.

Common Challenges

Six challenges show up in every production semantic segmentation program.

Class imbalance

Road and sky dominate the pixel count. Small and rare classes such as traffic cones or strollers appear in a small fraction of frames and occupy a tiny area when they do. Training without class balancing leaves models that perform well on common classes and fail on the classes that matter most for safety.

Ambiguous boundaries

Where exactly does vegetation end and sky begin behind a tree line? What about the transition between road and curb? Guideline documents need pixel-level examples for these cases, not text descriptions.

Occlusion

A pedestrian partially hidden by a parked vehicle needs to be labeled as "pedestrian" on the visible pixels, not extended into the occluding object. Handling partial occlusion consistently is a training item for annotators and a frequent source of inter-annotator disagreement.

Small objects

A traffic cone 80 meters ahead occupies a handful of pixels. Missing it in the annotation is easy. Over-extending the mask is just as easy. Small-object performance is often the gap between a model that works at highway speeds and one that does not.

Temporal consistency

Annotating each frame independently produces flicker between frames. Sequence-level tools that propagate labels across consecutive frames reduce flicker but introduce new failure modes when the propagation model drifts. Either way, temporal QC is a separate pass from per-frame QC.

Adverse weather and lighting

Rain, fog, snow, and night conditions degrade camera input and change the appearance of every class. Models trained only on clear-weather labels generalize poorly. Edge case coverage, targeted annotation, and weather-specific quality checks are all needed.

Best Practices

Five practices separate programs that produce useful training data from those that produce labeled noise.

Iterative ontology design

Treat the ontology as a living specification. Start simple, run the model, review errors, and refine. Expect three to five rounds of ontology adjustment in the first year of a program.

Pre-labeling with human-in-the-loop review

Pre-labeling accelerates throughput and standardizes the baseline labels. Human review catches model errors and supplies the training signal that improves the pre-labeling model for the next iteration. This feedback loop is the economic engine of any modern segmentation program, and it is where human-in-the-loop machine learning pays for itself.

Automated quality checks at ingestion

Every annotation should pass a battery of automated checks before it reaches the training set. Boundary coherence, class coverage, temporal consistency, and ontology conformance are table stakes. Catching errors in the annotation platform is an order of magnitude cheaper than catching them through model failures downstream.

Edge case oversampling

Systematic scenario coverage (night, rain, construction, rare classes) should be tracked explicitly. If the dataset has 100,000 sunny daytime frames and 400 foggy frames, the ontology and the model will reflect that imbalance. Over-weighting rare conditions in both annotation volume and training sampling is a deliberate choice that requires ontology-level coverage tracking.

Clear guidelines with pixel-level examples

Written guidelines are necessary but not sufficient. Annotators need images showing the exact pixel treatment for boundaries, occlusions, small objects, and ambiguous cases. When a new case appears, it gets added to the guidelines with an example image, and every annotator is re-briefed.

Where Kognic Fits

Kognic's annotation platform is built for semantic segmentation at AV production scale, across 2D images and 3D point clouds, with multi-sensor fusion support as a default rather than an add-on.

Over 4,000 trained annotators with AV domain expertise handle the work, supported by pre-labeling with human-in-the-loop review, over 90 automated quality checkers, and ontology tooling that evolves with the program. More than 100 million annotations have been delivered to OEMs and Tier 1 suppliers including Qualcomm, Continental, and Zenseact.

For teams building perception stacks that need pixel-level understanding of the driving environment, that track record matters more than any single technique. Semantic segmentation is not a one-shot labeling job. It is a continuous program with an ontology that evolves, a model that improves, and a quality bar that must hold across every batch.

If you are scoping a segmentation program or evaluating options for production-scale annotation, see our autonomous driving annotation guide or contact us to talk through the specifics.

Written by

Björn Ingmansson

Marketing Director

bjorn.ingmansson@kognic.com