Annotation errors pose an ever-present headache to machine learning engineers. To what extent is it worth eliminating them? We have indeed seen indications that modern machine learning might be able to learn well despite the presence of annotation errors, but a recent research article points out that annotation errors inevitably lead to incorrectly predicted classification uncertainties, even if countermeasures are taken. In this article, our Machine Learning Research Engineer Andreas Lindholm discusses the challenge of annotation errors when training machine learning models, how they affect your predictions and whether there are any alternatives to solve this.
The perception system is one of the vital components of an autonomous vehicle. If it fails to deliver an accurate description of the surroundings, other functionality such as driver planning will also fail. In this sense, a perception system has to reliably detect objects (such as other vehicles and pedestrians) and segments (such as the road surface) in a real-time stream of data from multiple sensors (such as images and Lidar point clouds).
Object detection and segmentation in video are complicated tasks. An arguably simpler task is classification of single images (e.g. “what type of vehicle is depicted in this image?”). Indeed, classification can be thought of as one of the building blocks that a full perception system has to solve. Classification is also the favorite problem studied by theoretical machine learning researchers. Nevertheless, it is not always obvious how to exactly extend their results to the full problem of object detection and segmentation in sequences. But, having said that, the learnings and the big picture are still relevant to understand. It can therefore be useful to understand the classification problem in some detail. In the classification terminology there is an input x (an image or similarly), and an output y (a label or annotation, representing a class such as “van” or “truck”), and the core problem is to train a model, a so-called classifier, to predict the output (class) from the input.
A fascinating property of most machine learning methods for classification is that a classifier learns to predict the probability of different classes, even though the training data only contains class labels with no probabilities. This is something most machine learning practitioners are fully aware of since the response to a prediction is a vector with positive numbers that sums to 1 (i.e., a probability vector), where each number is the probability of a class, respectively. Usually, the class with the highest number (highest predicted probability) is taken as the prediction presented to the end user.
The predicted probabilities are sometimes also called “confidence”, or similarly, to indicate that one has to be careful about their interpretation. Mathematically, machine learning methods understand the world as, for each input x, there is a probability g(x) that the input x belongs to class k, p(y=k|x)=g(x). That is, there is a certain ambiguity in the world (described by g(x)), and the black-and-white training data annotations (where each object is to be assigned one and only class each) is assumed to be an outcome (like a dice-roll) from g(x). Most machine learning methods aim to train a model which mimics this underlying probability-function g(x). If the machine learning model accurately learns the properties of g(x), the predicted probabilities represent the uncertainty in a prediction.
For a perception system, such uncertainty must be critical. Consider, for example, the question whether a quickly appearing vehicle on a crossing street is a responding fire truck or just a red-painted truck in front of a neon light sign: even if the prediction would give that it most likely is a usual truck, there is certainly a difference whether the responding fire truck option was given a negligible 0.01% or a whopping 49% probability. The former case could probably be safely discarded, whereas the latter case should be given quite some attention in the subsequent driver planning. But, of course, this hinges upon those predicted probabilities being reliable.
If a so-called strictly proper loss function is used when training the machine learning method, the model will be as close as possible to the real-world probabilities g(x), if only given enough training data. That means, the predicted probabilities will be reliable. A strictly proper loss function is a loss function whose sole minimizer is g(x), meaning it is minimized only when the predicted probabilities correspond to g(x). The cross-entropy (CrossEntropyLoss in Torch) is a strictly proper loss function.
For all practical machine learning, erroneous annotations (or label noise) is an acknowledged problem. How these errors behave is often hard to tell, but many different theoretical assumptions have been proposed. Some of these assumptions are almost naive (e.g. that a certain amount of class labels are flipped completely randomly, no matter what the input or class is) or require knowledge about things that we cannot possibly know (for example that the exact ratio of errors for each class is known beforehand). We also know that in reality the randomness is limited; errors in practice tend to be rather systematic. But randomness has, indeed, nicer theoretical properties, and some amount of randomness is certainly present. A simplified, not yet completely unrealistic, error assumption is the “simple non-uniform label noise”.
The simplest way to handle annotation errors is to ignore them, and train a model with a strictly proper loss function as you would have done if there were no errors. By doing so, and having little data and many errors, the errors will inevitably mislead the classifier. However, as the amount of (error-prone) data increases, the classifier will eventually learn to predict the correct class (the so-called classification boundaries) as long as the errors are random. If solely this aspect is considered when evaluating the accuracy of the classifier, the classifier might appear good. The accuracy says, however, nothing on how reliable the predicted uncertainties are.
Unfortunately, the annotation errors “adds on” to the probabilities g(x). Since the classifier learns blindly from the data without knowing what is an annotation error and what is not, the classifier will not be able to learn the probabilities correctly, no matter how much error-prone data it has access to. In fact, a recent article by Olmin and Lindsten shows that the classifier will predict probability with higher entropy than g(x). Loosely speaking it means that it will be too uncertain in its predictions.
Learning g(x) perfectly is actually a rather complicated task, in fact unnecessarily hard for practical purposes. A weaker, yet useful, concept is that of calibration, meaning that a classifier should predict probabilities that are consistent with the observed outcome. But, unfortunately, Olmin and Lindsten also shows that using a strictly proper loss function together with error-prone annotations will not even lead to a calibrated classifier.
Another option that has gained attention recently is to use a so-called robust loss function instead, such as mean absolute error (instead of cross-entropy). If using Torch, that amounts to using L1Loss instead of CrossEntropyLoss. With a robust loss function and random annotation errors, the classifier will learn to predict the correct class as the amount of data increases. There are experiments where this happens faster (that is, with less data) for robust loss functions than for strictly proper loss functions, although the mechanisms are theoretically not yet fully understood. Alas, in terms of classification accuracy a robust loss function can, to some extent, counteract the presence of annotation errors.
But, the same article by Olmin and Lindsten shows that robust loss functions are not strictly proper. Hence, the predicted probabilities will not be g(x) if using a robust loss function (no matter if the data contain annotation errors or not). In fact, they even show that the class of symmetric loss functions (which is the only invented practically useful class of robust loss functions) will not even give calibrated predictions. In other words, the predicted uncertainties will not be reliable when using a robust loss function.
It seems in conclusion that, as of today, no theoretically supported method exists by which you can get reliable uncertainty predictions from a classifier if you have annotation errors in your training data. Indeed many performance metrics of today are mostly focused around binary "hit or miss", such as precision and recall, and do not seriously take predicted probabilities into consideration.
We are convinced, however, that as the technology progresses and matures, calibrated predicted probabilities will eventually become bread and butter for machine learning engineers working with AD perception. Therefore, if you care about achieving good predictions, not only in terms of accuracy but also reliable uncertainties, you are best off eliminating your annotation errors. 🤗