De-identification Strategies

When should you use de-identification techniques?

Output privacy techniques are employed to protect the privacy of data when it is being shared or released to external parties. These methods aim to ensure that sensitive information is concealed while still providing useful insights and analysis to authorized users.

What are statistical disclosure controls? 👀

Statistical disclosure controls, including generalization, pseudonymization, differential privacy, and synthetic data etc. are effective techniques for ensuring output privacy.

These methods provide promising and reliable de-identification strategies for sharing data in structured databases while safeguarding sensitive information.

Generalization: 👥

Generalization is the process of replacing specific data values with broader categories to protect individual identities while enabling analysis.

For example, we can replace exact ages with age groups.

Original Data

Generalized Data

🧑‍💼 Age: 28

👥 Age Group: 20-30

👩 Age: 42

👥 Age Group: 40-50

👨‍🎓 Age: 19

👥 Age Group: 10-20

👵 Age: 68

👥 Age Group: 60-70

Pseudonymization: 🎭

Pseudonymization involves replacing identifying information with pseudonyms or aliases, allowing data linking for legitimate purposes while safeguarding identities.

For example, we can replace names with unique identifiers.

Original Data

Pseudonymized Data

👤 Name: John Smith

🆔 ID: ABC123

👤 Name: Jane Doe

🆔 ID: XYZ456

👤 Name: Bob Johnson

🆔 ID: PQR789

👤 Name: Mary Brown

🆔 ID: DEF321

Perturbation

What is perturbation? 👀

Perturbation involves adding random noise to the data attributes of individuals, making it difficult for intruders to understand the true values.

Adding Noise to Data Attributes: 🔊

In perturbation, we introduce random variations to individual data points, preserving the overall statistical properties while obscuring precise information.

Example - Healthcare Records:

Let’s continue with a healthcare example. In a traditional healthcare dataset, medical conditions may be represented by specific numerical values, making it potentially identifiable. However, with perturbation, we add random noise to the medical conditions.

Before Perturbation :

SoldierID

Medical Condition

👤 01

👤 002

👤 003

After Perturbation (Noisy Dataset):

SoldierID

Medical Condition

👤 001

👤 002

👤 003

By adding random noise to the medical conditions, it becomes challenging to infer the true values, providing an additional layer of privacy protection.

Differential Privacy: 📊

Differential Privacy is the method of adding random noise to query results, preserving privacy while maintaining data utility.

For example, we can add noise to aggregated age statistics.

Original Aggregated Age

Noisy Aggregated Age

👥 Age Group: 20-30

👥 Age Group: 19-31

👥 Age Group: 40-50

👥 Age Group: 39-51

👥 Age Group: 10-20

👥 Age Group: 9-21

👥 Age Group: 60-70

👥 Age Group: 59-71

Synthetic Data: 🤖

Synthetic Data involves generating artificial data that resembles real data, protecting privacy during research and testing.

For example, we can create synthetic customer profiles.

Real Data

Synthetic Data

👩 Name: Emily, Age: 32

👤 Name: Sarah, Age: 35

👨‍💼 Occupation: Engineer

👨‍💼 Occupation: Consultant

📞 Phone: 555-1234

📞 Phone: 555-5678

📧 Email: [email protected]

Grouping : 📉

Grouping is a strategy that involves aggregating related data attributes of individuals together to obscure individual information.

In grouping, we combine data attributes of individuals who share common characteristics, creating collective profiles that hide specific details while preserving trends.

Example - Educational Records:

Let’s consider an example in an educational dataset. In a traditional dataset, educational records may contain student names, test scores, and subjects studied. However, with grouping, we aggregate data attributes based on subjects studied.

Before Grouping :

Student Name

Math Score

Science Score

English Score

👨‍💼 John

👩 Jane

👨‍💼 Mike

After Grouping (Aggregated Dataset):

Subject

Average Score

Math

👥86

Science

👥89

English

👥82

By grouping data based on subjects studied, individual test scores are hidden, and only collective average scores are presented, safeguarding student privacy.

Mixing : 🔀

What is mixing? 🎰

Mixing is a clever technique that involves shuffling or rearranging data attributes of individuals to make it hard to find meaningful patterns.

In mixing, we rearrange the data attributes within a dataset so that sensitive information no longer corresponds to the original individual it belonged to.

This obscures the relationships between attributes, enhancing data privacy.

Let’s consider an example in a dataset. In a traditional dataset, records may contain attributes like income, age, and account balance, which could potentially identify individuals.

However, with mixing, we shuffle the order of these attributes within the dataset.

Before Mixing :

SoldierID

Income

Age

Account Balance

👤 001

$50,000

$10,000

👤 002

$40,000

$8,000

👤 003

$60,000

$12,000

After Mixing (Shuffled Dataset):

SoldierID

Age

Account Balance

Income

👤 001

$10,000

$40,000

👤 002

$8,000

$50,000

👤 003

$12,000

$60,000

By shuffling the order of attributes, it becomes challenging to link specific attributes back to a particular individual, preserving their privacy.

Perturbation : 🔊

What is perturbation? 👀

Perturbation involves adding random noise to the data attributes of individuals, making it difficult for intruders to understand the true values.

Adding Noise to Data Attributes: 🔊

In perturbation, we introduce random variations to individual data points, preserving the overall statistical properties while obscuring precise information.

Example - Healthcare Records:

Before Perturbation :

SoldierID

Medical Condition

👤 001

👤 002

👤 003

After Perturbation (Noisy Dataset):

SoldierID

Medical Condition

👤 001

👤 002

👤 003

By adding random noise to the medical conditions, it becomes challenging to infer the true values, providing an additional layer of privacy protection.

PreviousInput / Output Privacy NextK-Anonymity

Last updated 1 year ago

Was this helpful?

When should you use de-identification techniques?

What are statistical disclosure controls? 👀

Generalization: 👥

Pseudonymization: 🎭

Perturbation

After Perturbation (Noisy Dataset):

Differential Privacy: 📊

Synthetic Data: 🤖

Grouping : 📉

Aggregating Related Data Attributes:

Example - Educational Records:

Before Grouping :

After Grouping (Aggregated Dataset):

Mixing : 🔀

Perturbation : 🔊