ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode
Prerequisites
ACT 100: Dataset Activation with Data DistillerDownload the file:
Ingest the data as healthcare_customers
dataset using this:
Dataset Overview:
In this tutorial, we'll demonstrate how to handle sensitive healthcare data by applying various anonymization and pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We'll use SQL-based techniques to randomize, mask, and inject noise into the data, using the following strategies:
The dataset contains sensitive columns:
customer_id
: Unique identifier for each customer.customer_name
: Name of the customer.phone_number
: Customer’s phone number.email
: Customer’s email address.address
: Physical address.age
: Customer’s age.total_spent
: Total healthcare spending.subscription_status
: Whether the customer has a subscription plan.gender
: Gender of the customer.country
: Country of the customer.diagnosis_code
: Medical condition code (e.g., ICD-10).prescription
: Prescription given to the customer.
Randomization of Names
Randomization replaces sensitive data with random values. In this case, customer names will be randomized.
SELECT
customer_id,
CONCAT('User', CAST(FLOOR(RAND() * 10000) AS STRING)) AS randomized_name
FROM
healthcare_customers;
This query replaces customer names with random identifiers like User1234
, ensuring names are obfuscated.

Partial Masking of Phone Numbers and Emails
Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-identify individuals. This technique was used in the tutorial here.
SELECT
customer_id,
CONCAT('XXX-XXX-', SUBSTRING(phone_number, -4)) AS masked_phone,
CONCAT(SUBSTRING(email, 1, 4), '****@domain.com') AS masked_email
FROM
healthcare_customers;
This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address while keeping the domain intact.

Pseudonymization (Hashing) of Email Addresses
Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial identifiers or "pseudonyms," ensuring that the data can no longer be attributed to a specific individual without additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and the original identity is severed unless the key to reverse the process (often called a "re-identification key") is available.
Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it irreversible.
SELECT
customer_id,
SHA2(email, 256) AS hashed_email
FROM
healthcare_customers;
When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s virtually impossible to reverse the process and recover the original email. The output will always be 64 characters long, no matter the size of the input. This is like turning the email into a "digital fingerprint"—each email will have a distinct hash, but similar-looking emails (like [email protected]
and [email protected]
) will have completely different outputs. Hashing is used to protect sensitive information because it hides the original data while still allowing comparisons between hashed values.
SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses, resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash), making it highly reliable for even massive datasets like this,
K-Anonymity
In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.
Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:
WITH GeneralizedHealthcare AS (
SELECT
customer_id,
-- Generalize age into broader age ranges (20-year groups)
CASE
WHEN age BETWEEN 0 AND 19 THEN '0-19'
WHEN age BETWEEN 20 AND 39 THEN '20-39'
WHEN age BETWEEN 40 AND 59 THEN '40-59'
WHEN age BETWEEN 60 AND 79 THEN '60-79'
ELSE '80+'
END AS generalized_age,
-- Generalize country into fewer, broader regions
CASE
WHEN country IN ('Japan', 'China', 'Korea', 'India') THEN 'Asia'
WHEN country IN ('Australia', 'New Zealand') THEN 'Oceania'
WHEN country IN ('France', 'Germany', 'Italy', 'UK') THEN 'Europe'
ELSE 'Other'
END AS region,
diagnosis_code,
-- Generalize prescription into broader categories (example categories)
CASE
WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
ELSE 'Other Medications'
END AS generalized_prescription
FROM healthcare_customers
)
SELECT
generalized_age,
region,
diagnosis_code,
generalized_prescription
FROM
GeneralizedHealthcare
GROUP BY
generalized_age, region, generalized_prescription, diagnosis_code
HAVING COUNT(*) == 1;;
The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First, it generalizes the age
into 20-year ranges (e.g., 0-19
, 20-39
), and the country
is grouped into broad regions (e.g., Asia
, Europe
). The prescription
field is also generalized into broader categories like Painkillers, Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset is then grouped by these generalized dimensions, including diagnosis_code
. Our hope is that the HAVING COUNT(*) == 1
clause will return no results as no bucket of these grouping dimensions should have an individual in it.
The execution will show the following:

The current generalization of the prescription
dimension hasn’t provided sufficient anonymity..Since we have already bucketed all the other dimensions, the diagnosis_code
remains as the only ungrouped attribute and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as age, country
) to better capture larger groups. This highlights an important tradeoff: you'll need to determine which dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while still maintaining a balance between utility and privacy.
Let us try this:
WITH GeneralizedHealthcare AS (
SELECT
customer_id,
-- Generalize age into broader age ranges (three age buckets)
CASE
WHEN age BETWEEN 0 AND 29 THEN '0-29'
WHEN age BETWEEN 30 AND 59 THEN '30-59'
ELSE '60+'
END AS generalized_age,
-- Generalize country into larger, broader regions
CASE
WHEN country IN ('Japan', 'China', 'Korea', 'India', 'Australia', 'New Zealand') THEN 'Asia-Pacific'
WHEN country IN ('France', 'Germany', 'Italy', 'UK', 'Spain') THEN 'Europe'
WHEN country IN ('USA', 'Canada', 'Brazil') THEN 'Americas'
ELSE 'Other Regions'
END AS region,
diagnosis_code,
-- Generalize prescription into broader categories (example categories)
CASE
WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
ELSE 'Other Medications'
END AS generalized_prescription
FROM healthcare_customers
)
SELECT
generalized_age,
region,
diagnosis_code,
generalized_prescription
FROM
GeneralizedHealthcare
GROUP BY
generalized_age, region, diagnosis_code, generalized_prescription
HAVING COUNT(*) == 1;
This returns:

Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often achieve better results by combining multiple techniques.
Noise Injection for age
and total_spent
age
and total_spent
Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please explore the techniques for differential privacy in the tutorial here.
Substitution of Sensitive Data
Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.
WITH RandomizedDiagnosis AS (
SELECT
diagnosis_code,
CONCAT(
CHAR(FLOOR(RAND() * 26) + 65), -- First random letter (A-Z)
CHAR(FLOOR(RAND() * 26) + 65), -- Second random letter (A-Z)
CHAR(FLOOR(RAND() * 26) + 65) -- Third random letter (A-Z)
) AS random_code
FROM
(SELECT DISTINCT diagnosis_code FROM healthcare_customers) AS distinct_codes
)
SELECT
hc.customer_id,
rd.random_code AS substituted_diagnosis_code
FROM
healthcare_customers hc
JOIN
RandomizedDiagnosis rd
ON
hc.diagnosis_code = rd.diagnosis_code;
The result will be the following:

This query consistently replaces each unique diagnosis_code
in the healthcare_customers
table with a randomly generated three-letter code. It uses a Common Table Expression (CTE), RandomizedDiagnosis
, to generate a unique mapping for each distinct diagnosis code by creating a random three-letter string (using ASCII values for letters A-Z
). The DISTINCT
clause ensures that each diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the diagnosis_code
, ensuring that every instance of the same diagnosis code across the dataset is consistently replaced with the same random string. This approach provides a secure and consistent substitution of sensitive diagnosis codes, allowing for privacy while maintaining consistency for analysis.
Full Masking of Address and Prescription Data
In some cases, you may want to fully mask sensitive data fields to ensure privacy.
SELECT
customer_id,
REPEAT('*', LENGTH(address)) AS masked_address,
REPEAT('*', LENGTH(prescription)) AS masked_prescription
FROM
healthcare_customers;
This query masks the address
and prescription
fields entirely by replacing each character with an asterisk (*), making the data unreadable.

Shuffling
Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall data distribution.
WITH ShuffledData AS (
SELECT
customer_id,
total_spent,
ROW_NUMBER() OVER (ORDER BY customer_id) AS original_row, --orig row order
ROW_NUMBER() OVER (ORDER BY RAND()) AS shuffled_row -- shuffled row order
FROM
healthcare_customers
)
SELECT
original.customer_id,
original.total_spent AS original_total_spent,
shuffled.total_spent AS shuffled_total_spent
FROM
ShuffledData original
JOIN
ShuffledData shuffled
ON
original.shuffled_row = shuffled.original_row; -- Match original row to shuffled row
This query performs shuffling of the total_spent
values in the healthcare_customers
dataset while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE) to assign two row numbers: one based on the original order of the customers (original_row
) and another based on a random order (shuffled_row
).
By joining the CTE on these row numbers, the query reassigns the total_spent
values according to the shuffled row while preserving the original values.
Data Obfuscation
You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized queries.
The aes_encrypt
and aes_descrypt
functions in Data Distiller are used for encryption and decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII, financial data, or medical records) needs to be encrypted when stored at rest or in transit.
You can read about this in the tutorial here.
Last updated