ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode
Last updated
Powering Enterprise Use Cases While Keeping Sensitive Data in Safe Mode
Last updated
Download the file:
Ingest the data as healthcare_customers
dataset using this:
In this tutorial, we'll demonstrate how to handle sensitive healthcare data by applying various anonymization and pseudonymization techniques, ensuring compliance with data privacy regulations like GDPR. We'll use SQL-based techniques to randomize, mask, and inject noise into the data, using the following strategies:
The dataset contains sensitive columns:
customer_id
: Unique identifier for each customer.
customer_name
: Name of the customer.
phone_number
: Customer’s phone number.
email
: Customer’s email address.
address
: Physical address.
age
: Customer’s age.
total_spent
: Total healthcare spending.
subscription_status
: Whether the customer has a subscription plan.
gender
: Gender of the customer.
country
: Country of the customer.
diagnosis_code
: Medical condition code (e.g., ICD-10).
prescription
: Prescription given to the customer.
Randomization replaces sensitive data with random values. In this case, customer names will be randomized.
This query replaces customer names with random identifiers like User1234
, ensuring names are obfuscated.
Partial masking hides sensitive information while retaining some of the original content, making it difficult to re-identify individuals. This technique was used in the tutorial here.
This query partially masks the phone number by displaying only the last 4 digits and obscures part of the email address while keeping the domain intact.
Pseudonymization is a data protection technique that replaces identifiable information in a dataset with artificial identifiers or "pseudonyms," ensuring that the data can no longer be attributed to a specific individual without additional information. The pseudonymized data can still be analyzed and processed, but the link between the data and the original identity is severed unless the key to reverse the process (often called a "re-identification key") is available.
Pseudonymization transforms sensitive data into a hashed format using a cryptographic hash function, making it irreversible.
When you apply SHA2(email, 256), the email address is transformed into a unique, fixed-length string of characters using the SHA-256 cryptographic hash function. This process is one-way, meaning once the email is hashed, it’s virtually impossible to reverse the process and recover the original email. The output will always be 64 characters long, no matter the size of the input. This is like turning the email into a "digital fingerprint"—each email will have a distinct hash, but similar-looking emails (like customer1@domain.com
and customer2@domain.com
) will have completely different outputs. Hashing is used to protect sensitive information because it hides the original data while still allowing comparisons between hashed values.
SHA-256 would still work efficiently even if every person on Earth (say 10 billion people) had 1,000 email addresses, resulting in 10 trillion emails. This is because SHA-256 generates a fixed-length 64-character (256-bit) hash for any input, regardless of how many emails exist. The key strength of SHA-256 is that it provides an enormous number of possible hash values (about 22562^{256}2256 or approximately 107710^{77}1077)—far more than the total number of possible email addresses. This vast range minimizes the chance of collisions (two emails producing the same hash), making it highly reliable for even massive datasets like this,
Data Distiller also supports the MD5
function, which generates a 128-bit hash. For example:
SELECT customer_id, email, MD5(email) AS hashed_email FROM healthcare_customers;
This function is useful for use cases such as data integrity checks, quickly comparing large datasets, anonymizing data for non-security purposes, and creating partitioning or bucketing keys for efficient data distribution.
MD5
offers faster performance compared to stronger hashing algorithms like SHA-256,
making it suitable for non-sensitive tasks where speed is a priority. However, MD5
should not be used for cryptographic purposes or to store sensitive data, as it is vulnerable to hash collisions and security attacks. For security-related applications, stronger algorithms such as SHA-256
are recommended.
In K-anonymity, a privacy protection technique, the goal is to ensure that no individual is isolated in a group or bucket. Each record in the dataset must be indistinguishable from at least K-1 other individuals, based on a combination of generalized attributes (such as age, region, etc.). 2-anonymity means that the dataset we create should have at least 2 individuals who are identical are identical with respect to the attributes being considered. 3-anonymity means that the dataset we create should have at least 3 individuals who are identical are identical with respect to the attributes being considered i.e. each of the buckets we have should contain at least 3 individuals.
The grouping dimensions require careful data exploration and experimentation to identify the right set of attributes that meet such strict requirements. Once the minimum conditions are met, any new data added will only reinforce the criteria. As a best practice, you should always double-check the conditions with every dataset activation using this technique to ensure compliance.
Let us explore a few dimensions and see if our strategy for grouping satisfies 2-anonymity:
The query generalizes sensitive healthcare data to ensure privacy by grouping records based on broad categories. First, it generalizes the age
into 20-year ranges (e.g., 0-19
, 20-39
), and the country
is grouped into broad regions (e.g., Asia
, Europe
). The prescription
field is also generalized into broader categories like Painkillers, Antibiotics, and Blood Pressure Meds, with any unlisted medications categorized as Other Medications. The dataset is then grouped by these generalized dimensions, including diagnosis_code
. Our hope is that the HAVING COUNT(*) == 1
clause will return no results as no bucket of these grouping dimensions should have an individual in it.
The execution will show the following:
The current generalization of the prescription
dimension hasn’t provided sufficient anonymity..Since we have already bucketed all the other dimensions, the diagnosis_code
remains as the only ungrouped attribute and we may decide not to group it. If so, we may need to further generalize the existing dimensions (such as age, country
) to better capture larger groups. This highlights an important tradeoff: you'll need to determine which dimension is least critical to the use case and can be generalized further, allowing it to include more individuals while still maintaining a balance between utility and privacy.
Let us try this:
This returns:
Exercise: What other techniques from the previous sections could you apply to solve this? Remember, you can often achieve better results by combining multiple techniques.
age
and total_spent
Noise injection adds random noise to numeric data to obscure exact values while retaining overall trends. Please explore the techniques for differential privacy in the tutorial here.
Substitution replaces specific sensitive values with predefined values consistently across the entire dataset.
The result will be the following:
This query consistently replaces each unique diagnosis_code
in the healthcare_customers
table with a randomly generated three-letter code. It uses a Common Table Expression (CTE), RandomizedDiagnosis
, to generate a unique mapping for each distinct diagnosis code by creating a random three-letter string (using ASCII values for letters A-Z
). The DISTINCT
clause ensures that each diagnosis code only gets one random substitute. In the main query, the original table is joined with the CTE on the diagnosis_code
, ensuring that every instance of the same diagnosis code across the dataset is consistently replaced with the same random string. This approach provides a secure and consistent substitution of sensitive diagnosis codes, allowing for privacy while maintaining consistency for analysis.
Note that we did not use a subquery but instead used a CTE (Common Table Expression) Observe the key benefit of using a CTE (Common Table Expression) over a subquery approach in readability and reusability. CTEs allow you to define a temporary result set that can be referenced multiple times within the same query, making the SQL easier to read and maintain, especially when dealing with complex queries.
For example, in the query provided, the CTE RandomizedDiagnosis
allows the distinct diagnosis codes and their randomized substitutions to be computed once and then reused in the main query. This makes the code cleaner and separates the logic of generating random substitutions from the actual join operation. If you were to use a subquery, you'd potentially have to repeat the subquery each time it's needed, making the SQL harder to understand and more error-prone if changes are required in multiple places.
In some cases, you may want to fully mask sensitive data fields to ensure privacy.
This query masks the address
and prescription
fields entirely by replacing each character with an asterisk (*), making the data unreadable.
Shuffling data between records makes it harder to link records to specific individuals while maintaining the overall data distribution.
This query performs shuffling of the total_spent
values in the healthcare_customers
dataset while maintaining a clear tracking of how the values have been shuffled. It uses a Common Table Expression (CTE) to assign two row numbers: one based on the original order of the customers (original_row
) and another based on a random order (shuffled_row
).
By joining the CTE on these row numbers, the query reassigns the total_spent
values according to the shuffled row while preserving the original values.
This query may take some time to execute, so be prepared for a possible timeout in Ad Hoc Query Mode (Data Distiller Exploration). It is recommended to use Batch Query Mode by employing the CREATE TABLE
AS
command instead.
You can use encryption to obfuscate sensitive data (e.g., customer names, email addresses) to make it unreadable to unauthorized users. This helps protect data from being exposed in logs, intermediate processing, or unauthorized queries.
The aes_encrypt
and aes_descrypt
functions in Data Distiller are used for encryption and decryption of data using the Advanced Encryption Standard (AES). These functions can certainly be useful in obfuscation use cases, but they serve a broader purpose beyond just obfuscation. These functions are particularly useful for ensuring compliance with data security regulations such as GDPR or HIPAA, where data (e.g., PII, financial data, or medical records) needs to be encrypted when stored at rest or in transit.
Unlike hashing, which is a one-way process, AES encryption is reversible, allowing the data to be decrypted when needed by authorized users or systems. This is essential in use cases where you need to retrieve the original data later.
For example, encrypted customer records can be decrypted by an authorized user or system to retrieve the original information for processing.
You can read about this in the tutorial here.