The responsibility of protecting user’s privacy has moved on from being an ethical obligation to a legal requirement with the establishment of regulations such as the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) as of January 2020. Most countries have now adopted some form of user data protections regulations. With these regulations going into effect, large tech firms including giants such as Facebook and Google have already faced hefty fines for violations. While the bad news is that these heavy penalties can bankrupt your business, the good news is you can comply with these regulations while not compromising the needs of your business, including for analytics projects with third parties or to monetise your data.
Figure: Data protection and privacy legislations worldwide (source: UNCTAD 2020)
While there are techniques to overcome the lack of data, generally machine learning techniques are data hungry. This means that most machine learning models require being trained on a lot of annotated data before they can learn the patterns in the data and perform well in unforeseen scenarios. Despite artificial intelligence transforming entire industries, user privacy remains an unsolved challenge in the industry. So how can businesses make use of the huge amounts of data they have collected, while protecting their users’ privacy? The answers that have emerged come under the umbrella of Privacy-Preserving Machine Learning.
Privacy-Preserving Machine Learning
These are a subset of machine learning techniques that employ various mechanisms to enforce different conceptions of privacy. The four pillars of privacy preserving machine learning are training data privacy, input privacy, output privacy and model privacy, where each pillar guarantees the protection of private data in data creation and model creation stages. Although hard to achieve, ensuring privacy across all four pillars would result in perfectly privacy-preserving machine learning.
Some technologies used in this domain are:
- Differential Privacy
- Federated Learning
- Encrypted Deep Learning
- Secure Multi-Party Computation
- Homomorphic Encryption
Whilst all of them have excellent use cases in the industry, in this blog let us focus on Differential Privacy.
Differential privacy mathematically guarantees that anyone seeing the result of a differentially private analysis will essentially make the same inference about any individual’s private information, whether or not that individual’s private information is included in the input to the analysis. According to Cynthia Dwork, the inventor of differential privacy, this is a mathematically provable guarantee of privacy protection against a wide range of privacy attacks (including differencing attacks, linkage attacks, and reconstruction attacks). In the following, we describe how we have incorporated differential privacy in our Synthetic Data Engine, SyDE.
Our Approach: Go Synthetic!
At Trabeya, we believe synthetic data is the way ahead. We believe that when synthesized the right way, such data can preserve the right amount of statistical properties for business needs, including for model building, without compromising the privacy of the users.
Generative Adversarial Networks (GAN) are a type of neural network that is able to generate new data from scratch. They belong, together with autoencoders and flow based models, to a class of models called generative models. This means they can sample ‘new’ datapoints by learning the distribution of the data. GANs in particular are in fact comprised of two deep neural networks: a generator network which generates synthetic data from random-noise and a discriminator network which helps the generator minimize the differences between the real data and the synthetic data by catching out differences. However, the discriminator network has access to real data (whereas the generator only knows whether the discriminator caught it out or not), which may lead to ‘leaks’ of facts about the original dataset that we would rather avoid disclosing. This is where differential privacy comes in, ensuring that what’s learned has utility when it comes to the overall distribution of the data (at different granularities) but does not contain any private or sensitive information from the real data that could be ‘memorised’ by either network.
During the training process, this kind of GAN is able to learn the important statistical properties of the real data, and in the inference, can produce synthetic data whenever required simply by feeding in noise.
This process is not quite as simple as we have made it sound. A sound privacy-preserving generation of data results in a necessary compromise in the utility of the synthetic data in favour of privacy gained. Utility is generally measured by how similar the synthetic and the real datasets are. Some questions we ask ourselves about the synthetic dataset include: can it reveal important business insights almost as well as the real data? how close is almost? and how well can it perform on numerous machine learning tasks that in circumstances where privacy was not a constraint would have been performed on the real data?
On the other hand the concern is to protect all the private and sensitive personally identifiable information present in the data. Thus the challenge is to find the optimal balance between utility and privacy.
How the Future Looks
A great deal of groundbreaking work has been done and continues to happen in the past 2 years in the nexus of privatised computing and machine learning. Privacy focused open-source communities have emerged, including OpenMined, Tensorflow Privacy (from Google), CrypTen (from PyTorch/Facebook), and participants in these communities are actively contributing to building and refining frameworks that support privacy by design.
It is crucial that these advancements in research are transformed into business solutions with only the minimal waiting time necessary to confirm their robustness, and in such a way that they naturally fit into business workflows. We at Trabeya specialise in exactly this and we are committed to protecting user privacy and securing data whilst unlocking its true power.