ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller

Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode

Prerequisites

ACT 100: Dataset Activation with Data Distiller

Download the file:

1MB

healthcare_customers.csv

Open

Ingest the data as healthcare_customers dataset using this:

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Dataset Overview:

In this tutorial, we'll demonstrate how to handle sensitive healthcare data by applying various anonymization and pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We'll use SQL-based techniques to randomize, mask, and inject noise into the data, using the following strategies:

The dataset contains sensitive columns:

customer_id: Unique identifier for each customer.
customer_name: Name of the customer.
phone_number: Customer’s phone number.
email: Customer’s email address.
address: Physical address.
age: Customer’s age.
total_spent: Total healthcare spending.
subscription_status: Whether the customer has a subscription plan.
gender: Gender of the customer.
country: Country of the customer.
diagnosis_code: Medical condition code (e.g., ICD-10).
prescription: Prescription given to the customer.

Randomization of Names

Randomization replaces sensitive data with random values. In this case, customer names will be randomized.

SELECT
    customer_id,
    CONCAT('User', CAST(FLOOR(RAND() * 10000) AS STRING)) AS randomized_name
FROM
    healthcare_customers;

This query replaces customer names with random identifiers like User1234, ensuring names are obfuscated.

Partial Masking of Phone Numbers and Emails

Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-identify individuals. This technique was used in the tutorial here.

SELECT
    customer_id,
    CONCAT('XXX-XXX-', SUBSTRING(phone_number, -4)) AS masked_phone,
    CONCAT(SUBSTRING(email, 1, 4), '****@domain.com') AS masked_email
FROM
    healthcare_customers;

This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address while keeping the domain intact.

Pseudonymization (Hashing) of Email Addresses

Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial identifiers or "pseudonyms," ensuring that the data can no longer be attributed to a specific individual without additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and the original identity is severed unless the key to reverse the process (often called a "re-identification key") is available.

Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it irreversible.

SELECT   
    customer_id,
    SHA2(email, 256) AS hashed_email
FROM
    healthcare_customers;

When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s virtually impossible to reverse the process and recover the original email. The output will always be 64 characters long, no matter the size of the input. This is like turning the email into a "digital fingerprint"—each email will have a distinct hash, but similar-looking emails (like [email protected] and [email protected]) will have completely different outputs. Hashing is used to protect sensitive information because it hides the original data while still allowing comparisons between hashed values.

SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses, resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash), making it highly reliable for even massive datasets like this,

Data Distiller also supports the MD5 function, which generates a 128-bit hash. For example:

SELECT customer_id, email, MD5(email) AS hashed_email FROM healthcare_customers; This function is useful for use cases such as data integrity checks, quickly comparing large datasets, anonymizing data for non-security purposes, and creating partitioning or bucketing keys for efficient data distribution.

MD5 offers faster performance compared to stronger hashing algorithms like SHA-256, making it suitable for non-sensitive tasks where speed is a priority. However, MD5 should not be used for cryptographic purposes or to store sensitive data, as it is vulnerable to hash collisions and security attacks. For security-related applications, stronger algorithms such as SHA-256 are recommended.

K-Anonymity

In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.

The grouping dimensions require careful data exploration and experimentation to identify the right set of attributes that meet such strict requirements. Once the minimum conditions are met, any new data added will only reinforce the criteria. As a best practice, you should always double-check the conditions with every dataset activation using this technique to ensure compliance.

Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:

WITH GeneralizedHealthcare AS (
    SELECT 
        customer_id,
        -- Generalize age into broader age ranges (20-year groups)
        CASE 
            WHEN age BETWEEN 0 AND 19 THEN '0-19'
            WHEN age BETWEEN 20 AND 39 THEN '20-39'
            WHEN age BETWEEN 40 AND 59 THEN '40-59'
            WHEN age BETWEEN 60 AND 79 THEN '60-79'
            ELSE '80+' 
        END AS generalized_age,
        
        -- Generalize country into fewer, broader regions
        CASE 
            WHEN country IN ('Japan', 'China', 'Korea', 'India') THEN 'Asia'
            WHEN country IN ('Australia', 'New Zealand') THEN 'Oceania'
            WHEN country IN ('France', 'Germany', 'Italy', 'UK') THEN 'Europe'
            ELSE 'Other' 
        END AS region,
        
        diagnosis_code,
        
        -- Generalize prescription into broader categories (example categories)
        CASE 
            WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
            WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
            WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
            ELSE 'Other Medications' 
        END AS generalized_prescription
    FROM healthcare_customers
)
SELECT 
    generalized_age,
    region,
    diagnosis_code,
    generalized_prescription
FROM 
    GeneralizedHealthcare
GROUP BY 
    generalized_age, region, generalized_prescription, diagnosis_code
HAVING COUNT(*) == 1;;

The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First, it generalizes the age into 20-year ranges (e.g., 0-19, 20-39), and the country is grouped into broad regions (e.g., Asia, Europe). The prescription field is also generalized into broader categories like Painkillers, Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset is then grouped by these generalized dimensions, including diagnosis_code. Our hope is that the HAVING COUNT(*) == 1 clause will return no results as no bucket of these grouping dimensions should have an individual in it.

The execution will show the following:

The current generalization of the prescription dimension hasn’t provided sufficient anonymity..Since we have already bucketed all the other dimensions, the diagnosis_code remains as the only ungrouped attribute and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as age, country) to better capture larger groups. This highlights an important tradeoff: you'll need to determine which dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while still maintaining a balance between utility and privacy.

Let us try this:

WITH GeneralizedHealthcare AS (
    SELECT 
        customer_id,
        -- Generalize age into broader age ranges (three age buckets)
        CASE 
            WHEN age BETWEEN 0 AND 29 THEN '0-29'
            WHEN age BETWEEN 30 AND 59 THEN '30-59'
            ELSE '60+' 
        END AS generalized_age,
        
        -- Generalize country into larger, broader regions
        CASE 
            WHEN country IN ('Japan', 'China', 'Korea', 'India', 'Australia', 'New Zealand') THEN 'Asia-Pacific'
            WHEN country IN ('France', 'Germany', 'Italy', 'UK', 'Spain') THEN 'Europe'
            WHEN country IN ('USA', 'Canada', 'Brazil') THEN 'Americas'
            ELSE 'Other Regions' 
        END AS region,
        
        diagnosis_code,
        
        -- Generalize prescription into broader categories (example categories)
        CASE 
            WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
            WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
            WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
            ELSE 'Other Medications' 
        END AS generalized_prescription
    FROM healthcare_customers
)
SELECT 
    generalized_age,
    region,
    diagnosis_code,
    generalized_prescription
FROM 
    GeneralizedHealthcare
GROUP BY 
    generalized_age, region, diagnosis_code, generalized_prescription
HAVING COUNT(*) == 1;

This returns:

Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often achieve better results by combining multiple techniques.

Noise Injection for `age` and `total_spent`

Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please explore the techniques for differential privacy in the tutorial here.

Substitution of Sensitive Data

Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.

WITH RandomizedDiagnosis AS (
    SELECT 
        diagnosis_code,
        CONCAT(
            CHAR(FLOOR(RAND() * 26) + 65),  -- First random letter (A-Z)
            CHAR(FLOOR(RAND() * 26) + 65),  -- Second random letter (A-Z)
            CHAR(FLOOR(RAND() * 26) + 65)   -- Third random letter (A-Z)
        ) AS random_code
    FROM 
        (SELECT DISTINCT diagnosis_code FROM healthcare_customers) AS distinct_codes
)
SELECT 
    hc.customer_id,
    rd.random_code AS substituted_diagnosis_code
FROM 
    healthcare_customers hc
JOIN 
    RandomizedDiagnosis rd 
ON 
    hc.diagnosis_code = rd.diagnosis_code;

The result will be the following:

This query consistently replaces each unique diagnosis_code in the healthcare_customers table with a randomly generated three-letter code. It uses a Common Table Expression (CTE), RandomizedDiagnosis, to generate a unique mapping for each distinct diagnosis code by creating a random three-letter string (using ASCII values for letters A-Z). The DISTINCT clause ensures that each diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the diagnosis_code, ensuring that every instance of the same diagnosis code across the dataset is consistently replaced with the same random string. This approach provides a secure and consistent substitution of sensitive diagnosis codes, allowing for privacy while maintaining consistency for analysis.

Note that we did not use a subquery but instead used a CTE (Common Table Expression) Observe the key benefit of using a CTE (Common Table Expression) over a subquery approach in readability and reusability. CTEs allow you to define a temporary result set that can be referenced multiple times within the same query, making the SQL easier to read and maintain, especially when dealing with complex queries.

For example, in the query provided, the CTE RandomizedDiagnosis allows the distinct diagnosis codes and their randomized substitutions to be computed once and then reused in the main query. This makes the code cleaner and separates the logic of generating random substitutions from the actual join operation. If you were to use a subquery, you'd potentially have to repeat the subquery each time it's needed, making the SQL harder to understand and more error-prone if changes are required in multiple places.

Full Masking of Address and Prescription Data

In some cases, you may want to fully mask sensitive data fields to ensure privacy.

SELECT
    customer_id,
    REPEAT('*', LENGTH(address)) AS masked_address,
    REPEAT('*', LENGTH(prescription)) AS masked_prescription
FROM
    healthcare_customers;

This query masks the address and prescription fields entirely by replacing each character with an asterisk (*), making the data unreadable.

Shuffling

Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall data distribution.

WITH ShuffledData AS (
    SELECT
        customer_id,
        total_spent,
        ROW_NUMBER() OVER (ORDER BY customer_id) AS original_row,  --orig row order
        ROW_NUMBER() OVER (ORDER BY RAND()) AS shuffled_row  -- shuffled row order
    FROM
        healthcare_customers
)
SELECT
    original.customer_id,
    original.total_spent AS original_total_spent,
    shuffled.total_spent AS shuffled_total_spent
FROM
    ShuffledData original
JOIN
    ShuffledData shuffled
ON
    original.shuffled_row = shuffled.original_row;  -- Match original row to shuffled row

This query performs shuffling of the total_spent values in the healthcare_customers dataset while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE) to assign two row numbers: one based on the original order of the customers (original_row) and another based on a random order (shuffled_row).

By joining the CTE on these row numbers, the query reassigns the total_spent values according to the shuffled row while preserving the original values.

This query may take some time to execute, so be prepared for a possible timeout in Ad Hoc Query Mode (Data Distiller Exploration). It is recommended to use Batch Query Mode by employing the CREATE TABLE AS command instead.

Data Obfuscation

You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized queries.

The aes_encrypt and aes_descrypt functions in Data Distiller are used for encryption and decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII, financial data, or medical records) needs to be encrypted when stored at rest or in transit.

Unlike hashing, which is a one-way process, AES encryption is reversible, allowing the data to be decrypted when needed by authorized users or systems. This is essential in use cases where you need to retrieve the original data later.

For example, encrypted customer records can be decrypted by an authorized user or system to retrieve the original information for processing.

You can read about this in the tutorial here.

Last updated 1 year ago