Complete Data Coverage with the Kognic Platform

At Kognic we are dedicated to making safe self-driving perception a reality. We believe that we can vastly speed up development time of self-driving cars by improving the quality of the datasets out there, and speeding up the discovery of interesting scenarios. This is why we are offering a search and select solution to help companies build better datasets.

Why is data selection important, what challenges need to be overcome, and how does Annotell contribute to solving this problem? Read all about it in this article! And if you want to learn more, don’t hesitate to reach out to us.

The importance of data selection

Situations and objects in the world are not evenly distributed. When you are driving through the city there are cars everywhere, all the time. Often there are some pedestrians on the sidewalk, and depending on where you are in the world there can be either little or lots of bicycles. However, self-driving cars need to handle more than that. Some examples of ‘edge cases’ are crossing animals, children on bikes, or (for us in Sweden) the occasional crossing reindeer. The big challenge in autonomous driving is handling this long tail of rare, but safety-critical events.

A reindeer walking on a snow-covered road. Although it’s a rare occasion, self-driving cars must be able to deal with such situations.
A reindeer walking on a snow-covered road. Although it’s a rare occasion, self-driving cars must be able to deal with such situations.

When training a machine learning algorithm it is best if data is equally distributed. This means that ideally the dataset should contain percentually more bikes than there are actually in the world. It’s also important to keep in mind that neural networks work best by seeing examples multiple times, otherwise they forget these examples quickly and ignore the inputs they were trained on. This means that for the crossing reindeer example, enough data has to be collected for the neural network to learn what a reindeer looks like, just to handle that scenario.

The distribution of LiDAR points per class in the NuScenes dataset. One can see that there are many more points on cars and trucks than on animals and ambulances.
The distribution of LiDAR points per class in the NuScenes dataset. One can see that there are many more points on cars and trucks than on animals and ambulances.

Data selection for validation and labeling safaris

That brings us to the validation aspect of data selection. Autonomous vehicles are driving in many different scenarios, and have to prove that they can drive in all these different circumstances. This is often called the Operational Design Domain (ODD). The NHTSA has a recommended way of defining an ODD, including many suggestions for circumstances to take into account. This is described in their document “A Framework for Automated Driving System Testable Cases and Scenarios”. The existence of an ODD means it’s important to have sufficient testing coverage for all parts of your ODD, for the behavioral competencies the NHTSA recommends.

A small subset of behaviours self-driving cars need to manage as recommended by the NHTSA (taken from the Waymo safety report)
A small subset of behaviours self-driving cars need to manage as recommended by the NHTSA (taken from the Waymo safety report)

The current state of the art is that people go through hours and hours of videos recorded by a vehicle test fleet. Another alternative is to send drivers on a labeling safari: they try to actively end up in the scenarios which the car has a problem with. Although we are sure that driving through Sweden in search of reindeer on the road is a fun job, it is also very time and money intensive. We are dedicated to making labeling safaris a thing of the past by enabling developers to quickly search the previously collected but not yet annotated data of their fleet.

With our data coverage platform we speed up finding this data, and allow you to add it to specific data-sets which you want to train or evaluate your algorithms on. It’s also possible to request labeling of that data with only one click. Whenever a developer spots an interesting scenario in the data they can simply add it to one or multiple data lists which they can in the future use to re-train or re-evaluate their algorithms.

Active learning with your own model

Unfortunately it can be hard to reason what a machine learns, and what it is confused by. This is why we allow you to upload both predictions of your models for frames, as well as an uncertainty per object and frame. You can filter your unannotated data on this information, which gives a tremendous insight around what your model is biased to, which objects it is most likely missing, and what data you have to send to labeling. 

By incorporating your model predictions into the data search it’s easy to ‘bootstrap’ its performance by always selecting instances where your model is making many mistakes. Especially by using uncertainty per object it’s possible to find frames with objects who are right on the ‘decision boundary’ of your current detector, which would help your model tremendously when annotated.

The last challenge we want to address is that when a company has a sample of a car in a difficult situation, this is only one example of this situation. Unfortunately there is no easy way to find similar situations without spending a lot of time combing your data or setting up a labeling safari. This is why our data selection platform supports two interesting features: 

  • Finding similar images in your dataset
  • Searching your dataset using natural language

 

When searching data ourselves we find that the best data is found using a combination of these two features: find initial interesting scenarios using natural language, and refine the search using the similar image feature. 

A clear example of the power of our engine can be seen in the following images. We first search for ‘a caravan’ in our unannotated data, as this is something our car is struggling with. We not only find caravans and trucks, but also a recycling station. These are quite common in Munich, and indeed look like a caravan. If we search for similar looking images we find more of these recycling stations, which we can send to labeling so that our model learns that these are false positives.

The results in our data for a natural language search of ‘a caravan’
The results in our data for a natural language search of ‘a caravan’
The results in our data for a search for similar objects as the recycling station. You can see that we find both more recycling stations as well as other interesting objects for labeling.
The results in our data for a search for similar objects as the recycling station. You can see that we find both more recycling stations as well as other interesting objects for labeling.

Kickstart your data engine

Overall the Kognic platform is geared towards what we call your ‘data engine’. The process consists of incrementally finding data your model has a problem with, annotating this data, and evaluating your model. This enables you to continuously improve your model, and makes it possible to achieve safe perception quicker. We are excited about drastically reducing the cost of obtaining the coverage necessary to both achieve good perception performance, as well as being able to prove that the data you use for legislative approval actually covers relevant situations. 

Please reach out to us if you have any questions, or if interested in trying it for yourself with your own data.