Resources / blog

10 Options for Sharing Protected Data to Train Machine Learning Models

by Reesha Dedhia
A pencil checking off items on a checklist with graph visualizations in the background

Machine learning is a powerful tool for driving high performance in a broad scope of business applications. It is a powerful means for gaining insights from data by finding patterns and correlations that might otherwise remain obscured by routine analysis. In fact, what drives improved performance of machine learning (ML) is data, and lots of it.

Data is the fuel that feeds ML, and ML algorithms have a voracious appetite. Data is necessary for training ML algorithms in whatever niche they are applied; and without new, rich, and diverse sources of data, performance will plateau. The challenge for data scientists tasked with training and using ML tools has always been accessing those new sources of data. It’s not that they don’t exist, it’s that the organizations that own them are prohibited by law from sharing the data.

Government agencies have huge stores of data about citizens; healthcare organizations have huge stores of data about patients; financial services firms and retailers have huge stores of information about their customers. But regulations like the Privacy Act, Health Insurance Portability and Accountability Act (HIPAA), the Gramm Leach Bliley Act (GLBA), Europe’s General Data Protection Act (GDPR), and a host of other state, federal, and international laws exist to protect the sensitive data of individuals from being misused.

There are options for data scientists to tap into these rich sources of data to train their algorithms, but there are performance and risk considerations that must be taken into account. Here are the most common ways organizations share and run models on sensitive or regulated data.

The Options: Risk and Reward

Do Nothing

It is possible to choose not to access valuable datasets for training machine learning algorithms and rely solely on internal, unclassified data. This may be because the organization has not considered the potential value of introducing rich, diverse data into training models, or possibly because, given the regulatory environment, they simply didn't know it was possible to access protected data.

For data providers, the potential for sharing or monetizing data may be hindered because the organization doesn’t have their data in a central catalog or repository and is not cleaned and organized and prepared appropriately.

Illegal Sharing

Sharing confidential or regulated data in defiance of laws and contracts should not be an option. It is unethical and extremely risky. Authorities often take circumstances into consideration when assessing penalties for a data breach. Inadvertent breaches may be given light fines, but egregious or intentional violations can bring the full force of the law into effect.

Contractual Agreements

Sharing and securing sensitive data based strictly on contractual obligation is a high-risk approach and one that may not be legally acceptable in certain situations, such as [international data sharing](https://www.cnbc.com/2021/04/19/privacy-shield-eu-officials-pushing-hard-for-us-data-sharing-pact.html) where regulations prohibit cross-border data sharing. Differing legal approaches to privacy and data protection in the U.S., UK, EU, and the Pacific Rim countries complicates compliance.

Anonymization/De-identification

The use of anonymized or de-identified data is cost and labor-intensive as it requires a team of data scientists (the size of which is dependent on the size of the dataset) to manually clean, mask, and hash every attribute—a process that can take weeks or months, after which time-sensitive data may be obsolete. Anonymized data may also be subject to de-anonymization.

Data Decryption

Like anonymization, data decryption is an expensive and time-consuming process that defeats the purpose of data protection. Once an organization decrypts confidential or regulated data, it assumes the risks associated with violating contractual or regulatory obligations.

Synthetic Datasets

Training ML algorithms using synthetic data can be useful in certain applications, but for training with large-scale datasets, synthetic data risks introducing bias and error, defeating the purpose of model improvement on rich data because you are not sure if the results are trustworthy.

Data Clean Rooms

Data clean rooms are physical facilities set up for the purpose of allowing customers and partners of specific data aggregator organizations to bring their models in for testing against the aggregator’s data. The data owner runs the models and shares the results with the customer or partner without ever giving them access to the data. Data clean rooms are expensive to build and run and have limited application for innovative testing and modeling.

Federated Learning

With federated learning, models are trained in one environment and the resulting models are then shared with other organizations, who benefit from the model but never see the data used for training. This protects the data by maintaining close control of the data. For this reason, federated learning is extremely useful in specific applications. However, it requires the user to maintain a consistent format for the federated model to work, and it is not flexible enough to support innovative modeling based on an individual organization’s unique needs.

Homomorphic Encryption

Homomorphic encryption allows multiple parties to share and encrypt data under one key. That means privacy and confidentiality are maintained in theory, but security is reliant on trust--trust that parties with the decryption key are not collaborating with unauthorized parties, or otherwise violating the trust of those that have shared data. Furthermore, because of the formatting involved, the size of a data set can expand as changes are made to make models compatible, putting a burden on computing resources. As a result, homomorphic encryption tends to be slow, expensive, and impractical for full-scale production with large datasets. If different data sets are not formatted in the exact same way, the results can be unreliable.

A New Way Emerges

Encrypted Learning

Encrypted learning is a new approach for training machine learning algorithms that accesses classified data without decrypting it. Its unique combination of advanced cryptography and machine learning enables multiple parties to collaborate on data without ever seeing the data.

Encrypted learning uses secret sharing and multi-party computational techniques to make encrypted data available to data scientists for training. Privacy is protected by default. There is no key sharing and the data remains encrypted throughout the training process. Encrypted learning does not require that data owners and users share the same model. Owners have assurance their data is safe, and users can maintain the confidentiality of their models.

A growing number of data science teams are turning to encrypted learning to improve their machine learning models, both between departments and across organizations. Especially in financial services, government, and healthcare, where there are rich datasets not publicly available otherwise, encrypted learning is enabling trusted collaboration, unlocking hidden value, and providing a significant competitive edge.

If you’d like to learn more about encrypted learning and how your data science teams can access new and rich data sets safely and securely, or if you are exploring ways to monetize your own data stores without fear of violating contractual or regulatory restrictions, Cape can help. To book a meeting, [click here](https://capeprivacy.com/contact).

Our mission is simple: data privacy for all

We started Cape Privacy because Privacy-Enhancing technology should be a priority.

See our mission

Are you passionate about transforming the future?

Let’s transform the worlds of Data Privacy and Data Science together.

Join our team

Get started today

Find out moreContact us