Blog post: Data suitability

News | Posted on Wednesday 3 February 2021

Data that reflects the intended functionality

The performance and robustness of Machine Learning (ML) approaches, such as Deep Neural Networks (DNN), rely heavily on data. Furthermore, the training data encodes the desired functionality. However, it is challenging to collect (or generate) suitable data. And … what does data suitability mean?

We use the term data suitability when the data is free from:

under-sampling of relevant content (e.g. data features)
unintended correlations

In this case, data reflects the intended functionality.

Let’s have a look at video-based object detection that is used for perception in automated driving systems. DNNs are widely used for object detection in this application. For example, pedestrian detection (PDET) classifies pedestrians in images and localises them with a bounding box. Thereby, an increased variety and a broad distribution of the training dataset can improve DNN performance and robustness.

However, data quantity and variety alone are not sufficient to guarantee safe behaviour. To optimise data collection in an effective way, the characteristics of under- or over-sampled data must also be known.

On the one hand, certain data content (e.g. pedestrians wearing shorts) might be relevant to the task to be learned by a DNN. If this content is under-sampled, the chance that the DNN will not perform as intended under this condition is higher; it may produce unsatisfactory results or even unsafe behaviour.

On the other hand, data might include content that the DNN recognises as a pattern and learns to correlate (e.g. identifying dustbins as pedestrians). If these patterns are not relevant to the task, they too might cause unsafe behaviour.

Balancing the data regarding data suitability — no under-sampling of relevant content and freedom from unintended correlations– is not that easy or quick to achieve.

Imagine we discover an inadequate learnt pattern for a PDET: our DNN often detects trees as pedestrians. So we enrich our data with more variants of trees to enable the DNN to learn these are not pedestrians. After the re-training, we discover another inadequate pattern, we improve again — and so on.

To find an end to our data collection and PDET testing, we have to ask ourselves what behaviour is acceptable and what behaviour leads to a failure. These again might be dependent not only on the detection component itself, but also on the post-processing component. Maybe a subsequent tracking component can check the plausibility of the PDET output. Maybe a subsequent fusion component can reduce the localisation error of the PDET output.

All in all, we must understand thoroughly the intended functionality of the component embedded in an overall system in order to reach data suitability.

Lydia Gauerhof
Research Engineer
Corporate Research, Robert Bosch GmbH
Connect on LinkedIn

Read the blog on Medium

Further reading