MIT scientists have developed a revolutionary method to safeguard personal data while still ensuring the accuracy of machine learning models. These models have the capability to detect cancer in patients by analyzing images of their lungs. The challenge lies in disseminating this model to hospitals worldwide without compromising the privacy of sensitive information contained within the training data.
To train the model, scientists exposed it to millions of real lung scan images. However, the presence of this data renders it vulnerable to potential cyber attacks. To counter this risk, researchers aimed to add the least possible amount of noise to the model without affecting its accuracy. The concept of noise in this context is akin to adding static to a television channel.
MIT scientists have coined a privacy measurement known as Probably Approximately Correct (PAC) Privacy, which aids in determining the optimal level of noise necessary to maintain data privacy. The beauty of this system is that it can be applied to various models and applications without requiring in-depth knowledge of their inner workings or training mechanisms.
The implementation of PAC Privacy has shown that significantly less noise is required to protect sensitive data, when compared to other existing methods. This breakthrough has the potential to revolutionize the development of machine learning models that can effectively safeguard the data they operate on while maintaining accuracy.
It uses the uncertainty or randomness of the sensitive data in a clever way, and this lets us add, in many cases, a lot less noise. This system lets us understand the characteristics of any data processing and make it private automatically without unnecessary changes, explained Srini Devadas, an MIT professor who co-authored a paper on PAC Privacy.
One notable aspect of PAC Privacy is that users can specify their desired level of confidence in the safety of their data right from the beginning. For instance, users can set a condition where they want to make sure that a potential hacker could not recreate the sensitive data more than 1% accurately within 5% of its original value. The PAC Privacy system then provides the user with the precise amount of noise required to achieve these goals.
However, one limitation of PAC Privacy is that it does not provide information on the extent to which the model’s accuracy may be compromised when noise is added. Additionally, training a machine learning model repeatedly using different parts of the data can be computationally demanding.
Future improvements may focus on enhancing the stability of the machine learning training process, reducing the variation between different outputs. This would decrease the number of times the PAC Privacy system needs to run in order to identify the optimal level of noise, resulting in the addition of less noise overall.
Ultimately, MIT’s groundbreaking research could pave the way for more accurate machine learning models that effectively protect sensitive data, representing a significant win-win situation for technology and privacy.