MIT researchers have achieved a major breakthrough in addressing the challenge of privacy protection for machine learning models. In their groundbreaking study, the team of scientists developed a machine learning model that accurately predicts whether a patient has cancer based on lung scan images. However, sharing this model with hospitals worldwide poses a significant risk of potential data extraction by malicious agents.
To tackle this issue, the researchers introduced a novel privacy metric called Probably Approximately Correct (PAC) Privacy, along with a framework that determines the minimum level of noise required to protect sensitive data. Unlike conventional privacy approaches, such as Differential Privacy, which add massive amounts of noise to prevent an adversary from distinguishing specific data usage and consequently reduce the model’s accuracy, PAC Privacy evaluates an adversary’s difficulty in reconstructing parts of the sensitive data even after noise has been added.
For example, while differential privacy would prevent an adversary from determining if a particular individual’s face was in the dataset, PAC Privacy explores whether an adversary could extract an approximate silhouette that could be recognized as a particular individual’s face.
To implement PAC Privacy, the researchers developed an algorithm that calculates the optimal amount of noise to be added to a model, ensuring privacy even against adversaries with infinite computing power. This algorithm relies on the uncertainty or entropy of the original data from the adversary’s perspective. By subsampling data and running the machine learning training algorithm multiple times, the algorithm compares the variance across different outputs to determine the necessary amount of noise. A smaller variance indicates that less noise is required.
One of the key advantages of the PAC Privacy algorithm is that it doesn’t require knowledge of the model’s inner workings or the training process. Users can specify their desired confidence level regarding the adversary’s ability to reconstruct the sensitive data, and the algorithm provides the optimal amount of noise to achieve that goal. However, it’s important to note that the algorithm does not estimate the loss of accuracy resulting from adding noise to the model. Furthermore, implementing PAC Privacy can be computationally expensive due to the repeated training of machine learning models on various subsampled datasets.
To enhance PAC Privacy, the researchers suggest modifying the machine learning training process to increase stability, which would reduce the variance between subsample outputs. This approach would lessen the algorithm’s computational burden and minimize the amount of noise needed. Moreover, more stable models often exhibit lower generalization errors, leading to more accurate predictions on new data.
While the researchers acknowledge the need for further exploration of the relationship between stability, privacy, and generalization error, their work represents a promising step forward in protecting sensitive data in machine learning models. By leveraging PAC Privacy, engineers can develop models that secure training data while maintaining accuracy in real-world applications. The potential to significantly reduce the amount of noise required opens up new possibilities for secure data sharing in the healthcare domain and beyond.