An Introduction to Synthetic Data

Joleen March 17, 2022

According to Gartner, “by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.” I don’t know about you but I was shocked by that statistic and it shows that synthetic data is definitely a rising trend.

In this blog post I will be exploring synthetic data to find out:

What is synthetic data?
What’s so great about it?
What’s no-so-great about it?
What techniques are used to generate synthetic data?

What is synthetic data?

Any data generated by a computer simulation and not collected from real-world signals is known as synthetic data. Usually, an algorithm is trained on real data and is able to reproduce the same statistical properties of that dataset in a new, synthetic sample.

A classic example of synthetic data can be found in flight simulator games that attempt to mimic real-life events within a controlled environment.

Importantly, synthetic data can make new advances in AI possible and can enhance the decision-making process in businesses.

What are the benefits of synthetic data?

The development of cutting edge, successful models in AI and ML nowadays requires larger and larger volumes of high-quality data. Synthetic data is used across many different industries and can be particularly useful in these cases:

Preserving privacy by generating a synthetic dataset without any sensitive information. This is particularly useful in the healthcare and financial sectors.
Build complex models that require large amounts of data that are either too expensive or too time-consuming to collect (such as with self-driving cars, or other computer vision applications)
Researchers are able to explore and test new algorithms under controlled conditions with synthetic data.

What are the challenges with synthetic data?

Naturally, there are a number of challenges that come with generating and using synthetic data. I’m sure as more research is conducted in the area of synthetic data, these challenges will be easier to overcome in the future.

Synthetic data is only as good as the underlying model used to generate it. This would be a classic example of garbage in, garbage out. Any biases found in the original data will carry through into the synthetic data.
Generating synthetic data is a very complex, time-consuming process and requires highly skilled individuals to build the algorithms that generate the data.
Depending on the application of synthetic data, there could be some serious ramifications if the models that are built on synthetic data go wrong.
Business users or researchers may not be very trusting of synthetic data since it is still a very new area.
If data is generated using machine learning models then overfitting can lead to synthetic data that does not generalise well to real-world scenarios.

Techniques for generating synthetic data

Random sample generators: Scikit-learn has a library for generating datasets at various sizes and complexities.
Fitting to a known distribution: Monte Carlo method
Decision tree machine learning models
Generative adversarial network (GAN) – see this case study about how American Express used GAN’s to generate synthetic data that help make their fraud detection models more accurate.
Domain randomisation – altering images in a way that improves neural network models (such as changing the size, lighting, or colours in an image).

Synthetic data generation is being offered by more and more companies. Most notably is MIT’s Synthetic Data Vault which is an open source software ecosystem for generating synthetic data.