Kognic Blog

Kognic: Why more training data cannot make up for poor annotations

Written by Björn Ingmansson | Jun 22, 2021 1:24:00 PM

A common misconception in the automotive industry suggests that simply adding more training data can compensate for poor annotation quality. To challenge this assumption, Kognic conducted a comprehensive experiment using deliberate variations in 2D object detection annotation quality. Our findings revealed that low-quality annotations introduce systematic rather than random errors—and these errors don't simply "disappear" with more data. Instead, they become learned patterns that state-of-the-art object detectors interpret as intentional guidance.

This article explores our findings in detail, including how smaller objects were systematically overlooked in lower-quality annotations, creating a cascade effect where object detectors learned this behavior and similarly missed small objects in their predictions. For teams developing safety-critical autonomous systems, our recommendation is clear: invest in high-quality annotations from the start rather than attempting to compensate for poor data with increased volume—a strategy that our research proves ineffective.

We frequently encounter the belief that annotation errors become insignificant with sufficient training data volume. While research has supported this for certain deep learning applications (notably Rolnick et al. 2018 for classification and Chadwick and Newman 2019 for object detection), these studies typically assume randomly distributed errors. However, when annotation errors follow systematic patterns, this assumption breaks down dramatically. Consider an extreme example: if every pedestrian is consistently mislabeled as an animal, no amount of additional data will correct this fundamental error. Here's a visual representation:

The dashed blue line represents our target model—the reality we aim to capture precisely. The orange line shows what's actually learned based on available training data (black dots). Left panel: With minimal random errors (statistically unbiased), accurate learning occurs even with limited data. Middle panel: With larger but still random errors, the correct model remains learnable, though more data is required for precision. Right panel: With systematic errors (statistically biased), the correct model becomes fundamentally unlearnable regardless of data quantity. Our experiment investigated a critical question in real annotation workflows: are errors predominantly random (solvable with more data) or systematic (creating persistent biases regardless of volume)?

Experimental setup

We utilized client-provided imagery that was annotated through Kognic's production pipeline following our highest quality standards—incorporating well-defined guidelines, professionally trained annotators, and rigorous statistical quality assurance. This dataset contained 3,000 front-facing camera images capturing diverse driving conditions from highways to urban environments. All vehicles and pedestrians were precisely annotated with bounding boxes according to our established protocols. We designated this as Dataset A.

For comparative analysis, we annotated the identical image set using a different professional workforce, creating Dataset B. The key difference: Dataset B's annotators lacked specific training in both 2D bounding box annotation techniques and our particular annotation guidelines.

To evaluate how annotation quality differences affect machine learning outcomes, we trained Facebook's Detectron2 object detector on both datasets separately. The model trained on Dataset A is referred to as OD A, while the one trained on Dataset B is OD B.

Comparing the performance of OD A & B

To objectively evaluate both models, we introduced Dataset C—a separate collection of similar imagery, also annotated to our highest quality standards. When measuring average precision, OD A achieved 40.0 on Dataset C predictions, while OD B scored notably lower at 37.9.

While this performance gap confirms Dataset B's inferior quality, these metrics alone don't identify the nature of the errors. The critical question remained: were Dataset B's flaws random (potentially mitigable with more data) or systematic (creating persistent biases in the object detector's predictions)?

To investigate further, we implemented k-fold cross-validation. By dividing Datasets A and B into 5 segments each, using 4 for training and 1 for validation, then repeating this process across all possible combinations, we could estimate how each model would perform on similarly annotated new data. If Dataset B's errors were predominantly random, OD B should have shown markedly lower performance in this analysis. Surprisingly, both models performed nearly identically—OD A with 39.4 average precision versus OD B's 39.3.

This remarkably similar performance suggests that differences between the datasets stem primarily from systematic annotation patterns rather than random errors. Both annotation approaches were equally predictable, just in different ways. To better understand these systematic differences, we conducted deeper analysis of the data patterns.

Characteristics from annotations that are inherited in the predictions

To visualize the systematic differences between Dataset A and B and how these patterns manifested in OD A and OD B predictions, we analyzed various metrics. The annotations represent the training data, while the predictions show how each model performed on Dataset C's images.

First, we examined the raw object count across datasets:

  • In the annotations
    • Dataset A contained 55,930 bounding boxes
    • Dataset B contained 44,311 bounding boxes - 20% less
  • In the predictions
    • OD A predicted 39,544 bounding boxes
    • OD B predicted 34,028 bounding boxes - 14% less

Even at this high level, we observed a clear pattern: the training data characteristics directly influenced prediction behavior. OD B, trained on data with fewer annotated objects, consistently detected fewer objects in new imagery—though the effect was slightly less pronounced than in the training data difference.

Since Datasets A and B contain identical images with different annotations, we compared object counts frame-by-frame, alongside the predictions from OD A and OD B on Dataset C. The x-axis shows the difference in object count between equivalent images (positive values indicate Dataset A/OD A found more objects than Dataset B/OD B). In a perfect scenario with identical annotations, we would see only a single bar of height 3,000 at zero.

The annotation comparison reveals that while many frames showed identical object counts (the peak at zero), Dataset A typically identified more objects per frame than Dataset B. One extreme case—a highway scene with adjacent parking lot—showed 91 more objects in Dataset A, as the parking lot was meticulously annotated in A but completely overlooked in B. This pattern reappears in the predictions, where both object detectors learned their respective dataset characteristics, though with less extreme differences.

We also analyzed bounding box size distributions. The following graphs show overlaid (not stacked) histograms of bounding box sizes (measured by pixel area) in both datasets and their respective model predictions.

The key insight from the upper plot: Dataset A contains proportionally more small boxes than Dataset B, suggesting that Dataset B annotators were less attentive to smaller objects. This pattern transfers directly to the predictions, where OD A identifies more small objects than OD B. Again, the object detectors have internalized the annotation patterns they were trained on.

For deeper analysis, we compared individual bounding boxes between datasets, considering boxes to match when their Intersection over Union (IoU) exceeded 0.7. This approach visualizes specific differences between boxes, which we overlaid on sample images. Interestingly, while Dataset B contained fewer boxes overall, it did include some objects not present in Dataset A. Let's examine these first:

For completeness, we also compared boxes present in Dataset A but missing from Dataset B, along with the corresponding prediction patterns:

Our first observation: annotation differences primarily affect small, distant objects. Human annotators rarely disagree about prominent objects directly in front of the vehicle—a reassuring finding, but one that highlights the systematic nature of annotation errors.

A second critical insight: there's remarkable visual similarity between annotation differences and their corresponding prediction differences. The patterns aren't randomly distributed across location or size. "Found by B not A" boxes appear smaller and confined to a narrower horizontal band than "found by A not B" boxes—both in training data and predictions. Once again, annotation characteristics propagate directly into object detector behavior.

Finally, we examined differences in how bounding boxes were drawn around the same objects. Even when both datasets annotated the same object (IoU > 0.7), the precise box dimensions often varied. We compared the positional differences (left, right, bottom, top edges) between matching boxes:

Looking first at the annotation comparisons (upper graphs), we found surprisingly asymmetric distributions with a distinctive "spike and slab" pattern—a narrow peak (the "spike") alongside a spread-out component (the "slab") creating significant distribution tails. For reference, a fitted Gaussian distribution is shown, highlighting how poorly a normal distribution models these differences. This finding indicates that realistic simulation of annotation errors requires more sophisticated models than simple Gaussian noise.

Further analysis (not detailed here) revealed that the "slab" component primarily related to partially visible objects. The distribution tails represent disagreements in how annotators extrapolated incomplete information, with asymmetry stemming from different approaches between Dataset A and B annotators.

Most significantly, these same patterns reappear in the model predictions. The object detectors don't just learn what objects to annotate—they internalize specific annotation styles and box-drawing strategies from their respective training datasets!

Conclusions

Our experiment produced two annotation sets (Datasets A and B) from identical images, with Dataset B deliberately created to lower quality standards. While clear differences emerged between the datasets, our cross-validation analysis revealed something unexpected: both datasets were equally predictable by object detectors. This suggests that the quality difference wasn't primarily due to random errors but rather systematic annotation patterns that the models could effectively learn.

Our first key conclusion challenges conventional wisdom: the difference between carefully and less carefully annotated data isn't well characterized by random error injection. This experiment demonstrates that most low-quality annotation artifacts are systematic rather than random. Consequently, studies that randomly remove annotations to simulate quality issues provide little practical insight for real-world annotation quality decisions.

Our second conclusion is equally significant: object detectors faithfully learn and reproduce systematic annotation errors in their predictions. We found no evidence that OD B could have "generalized away" Dataset B's errors with additional low-quality training data.

For teams developing safety-critical perception systems, our recommendation is unambiguous: invest in high-quality training data from the beginning. Our research demonstrates that deliberately reducing annotation quality introduces few random errors but significant systematic ones. Any low-quality dataset likely contains subtle but consistent errors that become learned behaviors in your machine learning models—errors that directly propagate into predictions. There's simply no evidence that these systematic errors can be mitigated by adding more low-quality data that likely contains the same underlying biases.