ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode
Last updated
Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode
Last updated
Download the file:
Ingest the data as healthcare_customers
dataset using this:
In this tutorial, we'll demonstrate how to handle sensitive healthcare data by applying various anonymization and pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We'll use SQL-based techniques to randomize, mask, and inject noise into the data, using the following strategies:
The dataset contains sensitive columns:
customer_id
: Unique identifier for each customer.
customer_name
: Name of the customer.
phone_number
: Customer’s phone number.
email
: Customer’s email address.
address
: Physical address.
age
: Customer’s age.
total_spent
: Total healthcare spending.
subscription_status
: Whether the customer has a subscription plan.
gender
: Gender of the customer.
country
: Country of the customer.
diagnosis_code
: Medical condition code (e.g., ICD-10).
prescription
: Prescription given to the customer.
Randomization replaces sensitive data with random values. In this case, customer names will be randomized.
This query replaces customer names with random identifiers like User1234
, ensuring names are obfuscated.
Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-identify individuals. This technique was used in the tutorial here.
This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address while keeping the domain intact.
Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial identifiers or "pseudonyms," ensuring that the data can no longer be attributed to a specific individual without additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and the original identity is severed unless the key to reverse the process (often called a "re-identification key") is available.
Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it irreversible.
When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s virtually impossible to reverse the process and recover the original email. The output will always be 64 characters long, no matter the size of the input. This is like turning the email into a "digital fingerprint"—each email will have a distinct hash, but similar-looking emails (like customer1@domain.com
and customer2@domain.com
) will have completely different outputs. Hashing is used to protect sensitive information because it hides the original data while still allowing comparisons between hashed values.
SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses, resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash), making it highly reliable for even massive datasets like this,
In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.
Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:
The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First, it generalizes the age
into 20-year ranges (e.g., 0-19
, 20-39
), and the country
is grouped into broad regions (e.g., Asia
, Europe
). The prescription
field is also generalized into broader categories like Painkillers, Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset is then grouped by these generalized dimensions, including diagnosis_code
. Our hope is that the HAVING COUNT(*) == 1
clause will return no results as no bucket of these grouping dimensions should have an individual in it.
The execution will show the following:
The current generalization of the prescription
dimension hasn’t provided sufficient anonymity..Since we have already bucketed all the other dimensions, the diagnosis_code
remains as the only ungrouped attribute and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as age, country
) to better capture larger groups. This highlights an important tradeoff: you'll need to determine which dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while still maintaining a balance between utility and privacy.
Let us try this:
This returns:
Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often achieve better results by combining multiple techniques.
age
and total_spent
Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please explore the techniques for differential privacy in the tutorial here.
Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.
The result will be the following:
This query consistently replaces each unique diagnosis_code
in the healthcare_customers
table with a randomly generated three-letter code. It uses a Common Table Expression (CTE), RandomizedDiagnosis
, to generate a unique mapping for each distinct diagnosis code by creating a random three-letter string (using ASCII values for letters A-Z
). The DISTINCT
clause ensures that each diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the diagnosis_code
, ensuring that every instance of the same diagnosis code across the dataset is consistently replaced with the same random string. This approach provides a secure and consistent substitution of sensitive diagnosis codes, allowing for privacy while maintaining consistency for analysis.
In some cases, you may want to fully mask sensitive data fields to ensure privacy.
This query masks the address
and prescription
fields entirely by replacing each character with an asterisk (*), making the data unreadable.
Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall data distribution.
This query performs shuffling of the total_spent
values in the healthcare_customers
dataset while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE) to assign two row numbers: one based on the original order of the customers (original_row
) and another based on a random order (shuffled_row
).
By joining the CTE on these row numbers, the query reassigns the total_spent
values according to the shuffled row while preserving the original values.
You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized queries.
The aes_encrypt
and aes_descrypt
functions in Data Distiller are used for encryption and decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII, financial data, or medical records) needs to be encrypted when stored at rest or in transit.
You can read about this in the tutorial here.