Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Dataset Overview:
  • Randomization of Names
  • Partial Masking of Phone Numbers and Emails
  • Pseudonymization (Hashing) of Email Addresses
  • K-Anonymity
  • Noise Injection for age and total_spent
  • Substitution of Sensitive Data
  • Full Masking of Address and Prescription Data
  • Shuffling
  • Data Obfuscation
  1. UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT

ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller

Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode

Last updated 7 months ago

Prerequisites

Download the file:

Ingest the data as healthcare_customers dataset using this:

Dataset Overview:

In this tutorial, we'll demonstrate how to handle sensitive healthcare data by applying various anonymization and pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We'll use SQL-based techniques to randomize, mask, and inject noise into the data, using the following strategies:

The dataset contains sensitive columns:

  • customer_id: Unique identifier for each customer.

  • customer_name: Name of the customer.

  • phone_number: Customer’s phone number.

  • email: Customer’s email address.

  • address: Physical address.

  • age: Customer’s age.

  • total_spent: Total healthcare spending.

  • subscription_status: Whether the customer has a subscription plan.

  • gender: Gender of the customer.

  • country: Country of the customer.

  • diagnosis_code: Medical condition code (e.g., ICD-10).

  • prescription: Prescription given to the customer.

Randomization of Names

Randomization replaces sensitive data with random values. In this case, customer names will be randomized.

SELECT
    customer_id,
    CONCAT('User', CAST(FLOOR(RAND() * 10000) AS STRING)) AS randomized_name
FROM
    healthcare_customers;

This query replaces customer names with random identifiers like User1234, ensuring names are obfuscated.

Partial Masking of Phone Numbers and Emails

SELECT
    customer_id,
    CONCAT('XXX-XXX-', SUBSTRING(phone_number, -4)) AS masked_phone,
    CONCAT(SUBSTRING(email, 1, 4), '****@domain.com') AS masked_email
FROM
    healthcare_customers;

This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address while keeping the domain intact.

Pseudonymization (Hashing) of Email Addresses

Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial identifiers or "pseudonyms," ensuring that the data can no longer be attributed to a specific individual without additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and the original identity is severed unless the key to reverse the process (often called a "re-identification key") is available.

Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it irreversible.

SELECT   
    customer_id,
    SHA2(email, 256) AS hashed_email
FROM
    healthcare_customers;

When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s virtually impossible to reverse the process and recover the original email. The output will always be 64 characters long, no matter the size of the input. This is like turning the email into a "digital fingerprint"—each email will have a distinct hash, but similar-looking emails (like customer1@domain.com and customer2@domain.com) will have completely different outputs. Hashing is used to protect sensitive information because it hides the original data while still allowing comparisons between hashed values.

SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses, resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash), making it highly reliable for even massive datasets like this,

Data Distiller also supports the MD5 function, which generates a 128-bit hash. For example:

SELECT customer_id, email, MD5(email) AS hashed_email FROM healthcare_customers; This function is useful for use cases such as data integrity checks, quickly comparing large datasets, anonymizing data for non-security purposes, and creating partitioning or bucketing keys for efficient data distribution.

MD5 offers faster performance compared to stronger hashing algorithms like SHA-256, making it suitable for non-sensitive tasks where speed is a priority. However, MD5 should not be used for cryptographic purposes or to store sensitive data, as it is vulnerable to hash collisions and security attacks. For security-related applications, stronger algorithms such as SHA-256 are recommended.

K-Anonymity

In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.

The grouping dimensions require careful data exploration and experimentation to identify the right set of attributes that meet such strict requirements. Once the minimum conditions are met, any new data added will only reinforce the criteria. As a best practice, you should always double-check the conditions with every dataset activation using this technique to ensure compliance.

Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:

WITH GeneralizedHealthcare AS (
    SELECT 
        customer_id,
        -- Generalize age into broader age ranges (20-year groups)
        CASE 
            WHEN age BETWEEN 0 AND 19 THEN '0-19'
            WHEN age BETWEEN 20 AND 39 THEN '20-39'
            WHEN age BETWEEN 40 AND 59 THEN '40-59'
            WHEN age BETWEEN 60 AND 79 THEN '60-79'
            ELSE '80+' 
        END AS generalized_age,
        
        -- Generalize country into fewer, broader regions
        CASE 
            WHEN country IN ('Japan', 'China', 'Korea', 'India') THEN 'Asia'
            WHEN country IN ('Australia', 'New Zealand') THEN 'Oceania'
            WHEN country IN ('France', 'Germany', 'Italy', 'UK') THEN 'Europe'
            ELSE 'Other' 
        END AS region,
        
        diagnosis_code,
        
        -- Generalize prescription into broader categories (example categories)
        CASE 
            WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
            WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
            WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
            ELSE 'Other Medications' 
        END AS generalized_prescription
    FROM healthcare_customers
)
SELECT 
    generalized_age,
    region,
    diagnosis_code,
    generalized_prescription
FROM 
    GeneralizedHealthcare
GROUP BY 
    generalized_age, region, generalized_prescription, diagnosis_code
HAVING COUNT(*) == 1;;

The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First, it generalizes the age into 20-year ranges (e.g., 0-19, 20-39), and the country is grouped into broad regions (e.g., Asia, Europe). The prescription field is also generalized into broader categories like Painkillers, Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset is then grouped by these generalized dimensions, including diagnosis_code. Our hope is that the HAVING COUNT(*) == 1 clause will return no results as no bucket of these grouping dimensions should have an individual in it.

The execution will show the following:

The current generalization of the prescription dimension hasn’t provided sufficient anonymity..Since we have already bucketed all the other dimensions, the diagnosis_code remains as the only ungrouped attribute and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as age, country) to better capture larger groups. This highlights an important tradeoff: you'll need to determine which dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while still maintaining a balance between utility and privacy.

Let us try this:

WITH GeneralizedHealthcare AS (
    SELECT 
        customer_id,
        -- Generalize age into broader age ranges (three age buckets)
        CASE 
            WHEN age BETWEEN 0 AND 29 THEN '0-29'
            WHEN age BETWEEN 30 AND 59 THEN '30-59'
            ELSE '60+' 
        END AS generalized_age,
        
        -- Generalize country into larger, broader regions
        CASE 
            WHEN country IN ('Japan', 'China', 'Korea', 'India', 'Australia', 'New Zealand') THEN 'Asia-Pacific'
            WHEN country IN ('France', 'Germany', 'Italy', 'UK', 'Spain') THEN 'Europe'
            WHEN country IN ('USA', 'Canada', 'Brazil') THEN 'Americas'
            ELSE 'Other Regions' 
        END AS region,
        
        diagnosis_code,
        
        -- Generalize prescription into broader categories (example categories)
        CASE 
            WHEN prescription IN ('Aspirin', 'Ibuprofen') THEN 'Painkillers'
            WHEN prescription IN ('Amoxicillin', 'Azithromycin') THEN 'Antibiotics'
            WHEN prescription IN ('Lisinopril', 'Amlodipine') THEN 'Blood Pressure Meds'
            ELSE 'Other Medications' 
        END AS generalized_prescription
    FROM healthcare_customers
)
SELECT 
    generalized_age,
    region,
    diagnosis_code,
    generalized_prescription
FROM 
    GeneralizedHealthcare
GROUP BY 
    generalized_age, region, diagnosis_code, generalized_prescription
HAVING COUNT(*) == 1;

This returns:

Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often achieve better results by combining multiple techniques.

Noise Injection for age and total_spent

Substitution of Sensitive Data

Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.

WITH RandomizedDiagnosis AS (
    SELECT 
        diagnosis_code,
        CONCAT(
            CHAR(FLOOR(RAND() * 26) + 65),  -- First random letter (A-Z)
            CHAR(FLOOR(RAND() * 26) + 65),  -- Second random letter (A-Z)
            CHAR(FLOOR(RAND() * 26) + 65)   -- Third random letter (A-Z)
        ) AS random_code
    FROM 
        (SELECT DISTINCT diagnosis_code FROM healthcare_customers) AS distinct_codes
)
SELECT 
    hc.customer_id,
    rd.random_code AS substituted_diagnosis_code
FROM 
    healthcare_customers hc
JOIN 
    RandomizedDiagnosis rd 
ON 
    hc.diagnosis_code = rd.diagnosis_code;

The result will be the following:

This query consistently replaces each unique diagnosis_code in the healthcare_customers table with a randomly generated three-letter code. It uses a Common Table Expression (CTE), RandomizedDiagnosis, to generate a unique mapping for each distinct diagnosis code by creating a random three-letter string (using ASCII values for letters A-Z). The DISTINCT clause ensures that each diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the diagnosis_code, ensuring that every instance of the same diagnosis code across the dataset is consistently replaced with the same random string. This approach provides a secure and consistent substitution of sensitive diagnosis codes, allowing for privacy while maintaining consistency for analysis.

Note that we did not use a subquery but instead used a CTE (Common Table Expression) Observe the key benefit of using a CTE (Common Table Expression) over a subquery approach in readability and reusability. CTEs allow you to define a temporary result set that can be referenced multiple times within the same query, making the SQL easier to read and maintain, especially when dealing with complex queries.

For example, in the query provided, the CTE RandomizedDiagnosis allows the distinct diagnosis codes and their randomized substitutions to be computed once and then reused in the main query. This makes the code cleaner and separates the logic of generating random substitutions from the actual join operation. If you were to use a subquery, you'd potentially have to repeat the subquery each time it's needed, making the SQL harder to understand and more error-prone if changes are required in multiple places.

Full Masking of Address and Prescription Data

In some cases, you may want to fully mask sensitive data fields to ensure privacy.

SELECT
    customer_id,
    REPEAT('*', LENGTH(address)) AS masked_address,
    REPEAT('*', LENGTH(prescription)) AS masked_prescription
FROM
    healthcare_customers;

This query masks the address and prescription fields entirely by replacing each character with an asterisk (*), making the data unreadable.

Shuffling

Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall data distribution.

WITH ShuffledData AS (
    SELECT
        customer_id,
        total_spent,
        ROW_NUMBER() OVER (ORDER BY customer_id) AS original_row,  --orig row order
        ROW_NUMBER() OVER (ORDER BY RAND()) AS shuffled_row  -- shuffled row order
    FROM
        healthcare_customers
)
SELECT
    original.customer_id,
    original.total_spent AS original_total_spent,
    shuffled.total_spent AS shuffled_total_spent
FROM
    ShuffledData original
JOIN
    ShuffledData shuffled
ON
    original.shuffled_row = shuffled.original_row;  -- Match original row to shuffled row

This query performs shuffling of the total_spent values in the healthcare_customers dataset while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE) to assign two row numbers: one based on the original order of the customers (original_row) and another based on a random order (shuffled_row).

By joining the CTE on these row numbers, the query reassigns the total_spent values according to the shuffled row while preserving the original values.

This query may take some time to execute, so be prepared for a possible timeout in Ad Hoc Query Mode (Data Distiller Exploration). It is recommended to use Batch Query Mode by employing the CREATE TABLE AS command instead.

Data Obfuscation

You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized queries.

The aes_encrypt and aes_descrypt functions in Data Distiller are used for encryption and decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII, financial data, or medical records) needs to be encrypted when stored at rest or in transit.

Unlike hashing, which is a one-way process, AES encryption is reversible, allowing the data to be decrypted when needed by authorized users or systems. This is essential in use cases where you need to retrieve the original data later.

For example, encrypted customer records can be decrypted by an authorized user or system to retrieve the original information for processing.

Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-identify individuals. This technique was used in the tutorial.

Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please explore the techniques for differential privacy in the tutorial .

You can read about this in the tutorial.

here
here
here
ACT 100: Dataset Activation with Data Distiller
PREP 500: Ingesting CSV Data into Adobe Experience Platform
1MB
healthcare_customers.csv
Randomization of names
Partial masking.
There are 70 records that are single records in each of these 70 buckets that do not preserve 2-anonymity.
Our generalization has reduced the number from 70 to 1.
The codes are randomized but consistent
Full masking of some data fields.
Page cover image