Feature Story

Key Benefits of Synthetic Data and How They Improve Neural Networks Accuracy Through Bias Reduction

by Chris Longstaff, Vice President-Product Development, Mindtech Global

Chris is currently VP Product Management at Mindtech Global, a UK based start-up focused on AI and visual solutions, where he is responsible for product planning and marketing. Prior to his current role, Chris has many years of experience in the semiconductor industry with companies such as LSI Logic, ATI and most recently, Imagination Technologies. Chris has focused through-out his career on visual technologies such as video, ISP, GPU and AI. At Imagination, Chris held senior roles with responsibilities for business development, product mar-keting and product management.

Big DataDataset bias is created when the training data set does not contain adequate representations of the results that the end system is attempting to identify. Examples of such bias could be where the number of angles of a vehicle are too limited, the makes and models limited, the background styles, the lighting conditions and so forth. Great care needs to be taken here; for example, while a dataset may contain an even spread of ethnicities in the samples of pedestrians, if the majority of images of one of the ethnicities is taken front on, while the other is taken side on, you may just train a neural network to recognise the difference between front on and side on pedestrians, which may not be the intended functionality. The buildings, vehicles and other objects within an image are typical examples that will lead to geographical bias within a dataset.

Introduction

Artificial Intelligence, machine learning, deep learning, neural networks...; by whatever name you know the technology, they have revolutionized the field of image understanding and recognition. From the victory of the Alexnet neural network in the 2012 Imagenet Large Scale Visual Recog-nition Challenge (ILSVRC), to the 2019 introduction of Tesla's FSD SoC into production vehicles, with 72 TOPs of AI compute performance, these deep learning solutions have proved their value. Neural networks consist of two key parts: The network itself, consisting of a number of different interconnected layers perform-ing various functions; and the data to train this network. The two are inseparable; without both elements, the network will not function.The large number of frameworks and libraries available (TensorFlow, PyTorch, Chainer, Keras, ...) and the development of hardware designed to accelerate both the training and inferencing of neural networks has ensured that the rapid development and deployment of networks over the last few years.

The other side to neural networks, the provision of training data for these networks, has also advanced, though it still faces serious issues. Typically, the training data will consist of an "input", an image in the case of visual neural networks and a ground truth, "output". This ground truth output consists of some form of annotation indicating whatever the network is intended to predict; it may be a simple single label such as "Car", "Pedestrian", or something more sophisticated like the identification of every pixel that is part of an object "seman-tic labelling". In conjunction with the network design, the quantity and quality of the data that is used to train the networks will determine the accuracy of the results. The focus of this article is looking at available annotated data, and how we can use synthetic data to overcome the shortcomings of "real" data to avoid the unintended bias that real data may introduce.

Obtaining training data

So how do we go about getting the data to train our networks? There are a few public datasets available for common use cases like automotive, or simple object detection; but very few available for more diverse applications.

Large retail, search, social media and automotive companies may have access to significant quantities of data, but their data will typically be proprietary (not available to external parties to use) and will also lack annotations. Annotating this data is typically a time-consuming manual process that uses low-wage economies to perform the task. Many tools exist to help with this labelling, and indeed many of those tools utilize AI techniques themselves to speed up the process, but there is no shortcut to make this task fully automated. These annotations will also not be 100% accurate. It is also unlikely that these companies will share their datasets openly, especially annotated data.

Data is the new oil

Companies may decide to go and gather and annotate their own footage for training, this leads to several issues:

Bias - Real world data may create bias if the collection fails to obtain enough cover-age of the intended subjects, see section below for more details.

Privacy - there are issues of where compa-nies may freely film. Even if the data collected is used to generate non-identifi-able training data, if this is stored on a server, any data breach could lead to serious consequences.

Suitability - Finding the right content for source and annotation may be difficult, for example, if I am based in the UK, but want footage of typical cars and pedestrians in an Asian city, it will be more difficult to obtain. The wrong data can lead to bias.

Annotation - The capturing of footage is only one side of the problem. We need to ensure that the data that we capture is correctly annotated for machine learning systems. Pixel perfect annotation is not feasible with captured real-world data.

Corner Case - Modelling corner cases can be inherently difficult and potentially dangerous.

Inference System Modelling - When data is captured for training, it does not typically represent the system that will be used for inferencing. The capture system will introduce several distortions such as lens, colorations, lens shading, noise profiles and so on. If these are ignored, the training will be less accurate.

Synthetic data to the rescue

These issues with real world data can be overcome by supplementing real world data with Synthetic data. Synthetic data are computer generated images and annotations that are designed to represent the real world, typically using realistic 3D graphics models, allowing for reconstruction of an unlimited virtual world, from which we can create our synthetic data set. As we have generated the images, we can automatically create pixel perfect annotation. And that annotation can include advanced features such as 3D-bounding boxes and velocity vectors.

The bias problem
One of the key issues with neural networks is bias. Bias takes several forms , e.g., dataset bias, model bias, learner bias, system bias and human in the loop bias. The referenced paper describes these in detail but is beyond the scope of this article. Any form of unintentional bias can lead to unwanted consequences in the predictions of the deep learning system. We can help address the issues of bias in the data set by employing synthetic data..

Bias reduction
Dataset bias is created when the training data set does not contain adequate represen-tations of the results that the end system is attempting to identify. Examples of such bias could be where the number of angles of a vehicle are too limited, the makes and models limited, the background styles, the lighting conditions and so forth. Great care needs to be taken here; for example, while a dataset may contain an even spread of ethnicities in the samples of pedestrians, if the majority of images of one of the ethnic-ities is taken front on, while the other is taken side on, you may just train a neural network to recognise the difference between front on and side on pedestrians, which may not be the intended functional-ity. The buildings, vehicles and other objects within an image are typical examples that will lead to geographical bias within a dataset.

If the bias in the dataset can be identified, then we can utilize synthetic data to reduce that bias. In the examples above, the synthetic data generated may be used to create pedestrians of all ethnicities, vehicles from any country, building architectures from any geography, viewed from any angle. The advantage here of the synthetic data generator is the ability to script the simulator, and produce a wide range of training data "sweeps", varying time of day, angle of view, pedestrian sex, age and ethnicity etc. Recent academic interest in the subject of using synthetic data to reduce bias has led to papers being published on the topic such as "Can Synthetic Faces Undo the Damage of Dataset Bias to Face Recognition and Facial Landmark Detection?" . This paper shows that the use of synthetic data to prime the training of the neural network can reduce the bias of the dataset and actively improves the results of facial detection and facial landmark detection. The results also show that a significant reduction in the size of the real-world data set is also possible, whilst maintaining similar accuracy to a full real-world dataset.

Conclusion

The data and annotations used for training are key for the success of any AI implemen-tation using neural networks. Whilst Real-world data will typically form the basis for the dataset, the use of synthetic data can be used to reduce bias within that dataset. This reduction in bias can be particularly useful where geographically diverse datasets are required, without the cost of obtaining that data.

Return to: 2020 Feature Stories