A healthcare AI system uses patient data to predict disease risk. To comply with HIPAA and reduce the risk of re-identification, which technique should be applied to the training data before model development?
Trap 1: Pseudonymisation by replacing patient names with random IDs
Pseudonymisation is reversible and can be combined with other data to re-identify individuals; it does not satisfy HIPAA's de-identification requirements.
Trap 2: Data augmentation to create synthetic samples
Synthetic data can still leak information from the original data and does not provide formal privacy guarantees.
Trap 3: Data minimisation by removing all features except age and gender
Removing features may still allow re-identification and does not provide formal privacy guarantees required by HIPAA.
- A
Pseudonymisation by replacing patient names with random IDs
Why wrong: Pseudonymisation is reversible and can be combined with other data to re-identify individuals; it does not satisfy HIPAA's de-identification requirements.
- B
Data augmentation to create synthetic samples
Why wrong: Synthetic data can still leak information from the original data and does not provide formal privacy guarantees.
- C
Differential privacy with a carefully chosen epsilon
Differential privacy adds controlled noise to protect individual records, meeting HIPAA's de-identification standards with formal guarantees.
- D
Data minimisation by removing all features except age and gender
Why wrong: Removing features may still allow re-identification and does not provide formal privacy guarantees required by HIPAA.