De-identification Strategies
Last updated
Last updated
Output privacy techniques are employed to protect the privacy of data when it is being shared or released to external parties. These methods aim to ensure that sensitive information is concealed while still providing useful insights and analysis to authorized users.
Statistical disclosure controls, including generalization, pseudonymization, differential privacy, and synthetic data etc. are effective techniques for ensuring output privacy.
These methods provide promising and reliable de-identification strategies for sharing data in structured databases while safeguarding sensitive information.
Generalization is the process of replacing specific data values with broader categories to protect individual identities while enabling analysis.
For example, we can replace exact ages with age groups.
Original Data
Generalized Data
๐งโ๐ผ Age: 28
๐ฅ Age Group: 20-30
๐ฉ Age: 42
๐ฅ Age Group: 40-50
๐จโ๐ Age: 19
๐ฅ Age Group: 10-20
๐ต Age: 68
๐ฅ Age Group: 60-70
Pseudonymization involves replacing identifying information with pseudonyms or aliases, allowing data linking for legitimate purposes while safeguarding identities.
For example, we can replace names with unique identifiers.
Original Data
Pseudonymized Data
๐ค Name: John Smith
๐ ID: ABC123
๐ค Name: Jane Doe
๐ ID: XYZ456
๐ค Name: Bob Johnson
๐ ID: PQR789
๐ค Name: Mary Brown
๐ ID: DEF321
What is perturbation? ๐
Perturbation involves adding random noise to the data attributes of individuals, making it difficult for intruders to understand the true values.
Adding Noise to Data Attributes: ๐
In perturbation, we introduce random variations to individual data points, preserving the overall statistical properties while obscuring precise information.
Example - Healthcare Records:
Letโs continue with a healthcare example. In a traditional healthcare dataset, medical conditions may be represented by specific numerical values, making it potentially identifiable. However, with perturbation, we add random noise to the medical conditions.
Before Perturbation :
SoldierID
Medical Condition
๐ค 01
2
๐ค 002
1
๐ค 003
3
๐ค 001
3
๐ค 002
2
๐ค 003
4
By adding random noise to the medical conditions, it becomes challenging to infer the true values, providing an additional layer of privacy protection.
Differential Privacy is the method of adding random noise to query results, preserving privacy while maintaining data utility.
For example, we can add noise to aggregated age statistics.
Original Aggregated Age
Noisy Aggregated Age
๐ฅ Age Group: 20-30
๐ฅ Age Group: 19-31
๐ฅ Age Group: 40-50
๐ฅ Age Group: 39-51
๐ฅ Age Group: 10-20
๐ฅ Age Group: 9-21
๐ฅ Age Group: 60-70
๐ฅ Age Group: 59-71
Synthetic Data involves generating artificial data that resembles real data, protecting privacy during research and testing.
For example, we can create synthetic customer profiles.
Real Data
Synthetic Data
๐ฉ Name: Emily, Age: 32
๐ค Name: Sarah, Age: 35
๐จโ๐ผ Occupation: Engineer
๐จโ๐ผ Occupation: Consultant
๐ Phone: 555-1234
๐ Phone: 555-5678
๐ง Email: emily@example.com
๐ง Email: sarah@example.com
Grouping is a strategy that involves aggregating related data attributes of individuals together to obscure individual information.
In grouping, we combine data attributes of individuals who share common characteristics, creating collective profiles that hide specific details while preserving trends.
Letโs consider an example in an educational dataset. In a traditional dataset, educational records may contain student names, test scores, and subjects studied. However, with grouping, we aggregate data attributes based on subjects studied.
Student Name
Math Score
Science Score
English Score
๐จโ๐ผ John
85
90
78
๐ฉ Jane
95
92
88
๐จโ๐ผ Mike
78
85
80
Subject
Average Score
Math
๐ฅ86
Science
๐ฅ89
English
๐ฅ82
By grouping data based on subjects studied, individual test scores are hidden, and only collective average scores are presented, safeguarding student privacy.
What is mixing? ๐ฐ
Mixing is a clever technique that involves shuffling or rearranging data attributes of individuals to make it hard to find meaningful patterns.
In mixing, we rearrange the data attributes within a dataset so that sensitive information no longer corresponds to the original individual it belonged to.
This obscures the relationships between attributes, enhancing data privacy.
Letโs consider an example in a dataset. In a traditional dataset, records may contain attributes like income, age, and account balance, which could potentially identify individuals.
However, with mixing, we shuffle the order of these attributes within the dataset.
Before Mixing :
SoldierID
Income
Age
Account Balance
๐ค 001
$50,000
30
$10,000
๐ค 002
$40,000
25
$8,000
๐ค 003
$60,000
28
$12,000
After Mixing (Shuffled Dataset):
๐ค 001
25
$10,000
$40,000
๐ค 002
28
$8,000
$50,000
๐ค 003
30
$12,000
$60,000
By shuffling the order of attributes, it becomes challenging to link specific attributes back to a particular individual, preserving their privacy.
What is perturbation? ๐
Perturbation involves adding random noise to the data attributes of individuals, making it difficult for intruders to understand the true values.
Adding Noise to Data Attributes: ๐
In perturbation, we introduce random variations to individual data points, preserving the overall statistical properties while obscuring precise information.
Example - Healthcare Records:
Letโs continue with a healthcare example. In a traditional healthcare dataset, medical conditions may be represented by specific numerical values, making it potentially identifiable. However, with perturbation, we add random noise to the medical conditions.
Before Perturbation :
๐ค 001
2
๐ค 002
1
๐ค 003
3
After Perturbation (Noisy Dataset):
๐ค 001
3
๐ค 002
2
๐ค 003
4
By adding random noise to the medical conditions, it becomes challenging to infer the true values, providing an additional layer of privacy protection.