ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
Explore advanced differential privacy techniques to securely activate data while balancing valuable insights and individual privacy protection."
Last updated
Explore advanced differential privacy techniques to securely activate data while balancing valuable insights and individual privacy protection."
Last updated
Download the file:
Ingest the data as healthcare_customers
dataset using this:
One of the biggest challenges in personalization is determining how far a company should go in leveraging customer data to create a highly tailored experience that wows the customer. The question arises: What machine learning algorithm will deliver the perfect offer at the right time? However, pursuing this goal comes with significant risks. Over-personalization can make customers uncomfortable, and even after removing personally identifiable information (PII), those designing the algorithms or offers can still infer personal details, such as recognizing a neighbor who shops with the company. This raises a crucial ethical dilemma—how far should we go to enhance the customer experience while also safeguarding their privacy?
The solution lies in recognizing that there’s a trade-off. To respect customer privacy, companies must be willing to sacrifice some degree of accuracy, and possibly some profit, to ensure customers feel secure when interacting with a brand. By embracing differential privacy techniques, like adding noise to datasets, we can protect individual identities while still gaining valuable insights. In doing so, companies demonstrate that they prioritize not only profits but also the privacy and trust of their customers.
Data Distiller enables a wide variety of use cases, such as data activation for enterprise reporting, feature engineering for machine learning, enriching the enterprise identity graph, and creating custom audiences in specialized formats. However, dataset activation requires responsible consideration of the data being shared. While techniques like stripping sensitive information, masking, and anonymization are all viable options, you still need enough behavioral data for meaningful downstream analysis. The challenge is ensuring that the data you activate is not so raw or transparent that someone within your company could reverse-engineer the identities of individuals. How do you balance utility with privacy to protect individuals while maintaining valuable insights?
Here are a few use cases from the Capability Matrix that you might consider approaching differently when activating datasets with Data Distiller:
Data Distiller Audiences with Privacy:: When activating audiences from Data Distiller, you can use noisy datasets to segment customers based on behavior, demographics, or purchase history without exposing precise individual data. This approach safeguards sensitive customer information while still enabling effective segmentation for marketing campaigns.
A/B Testing with Privacy Enhancements:: Use noisy data to perform A/B testing on customer interactions with different marketing strategies. Noise can help ensure that individual customers’ data points are less identifiable while still allowing you to measure the success of each strategy effectively.
Predictive Modeling with Protected Data: Develop models to predict customer behavior (e.g., churn prediction, purchase likelihood) where individual customer records are perturbed to protect privacy. You can still identify trends and make predictions for your marketing efforts.
Lookalike Modeling for Ad Targeting: Create lookalike audiences by training models on noisy data, which can help marketers find potential new customers who exhibit similar behaviors to existing high-value customers. Adding noise preserves privacy while still providing valuable insights for targeting.
Personalized Recommendations with Privacy: Generate privacy-preserving personalized product or content recommendations. By adding noise, you ensure that individual preferences are obscured, but trends can still drive relevant recommendations.
Customer Lifetime Value (CLV) Estimation with Noise: Calculate customer lifetime value using noisy datasets to avoid exposing sensitive financial or transactional details of individuals while still identifying trends and high-value customer segments for personalized marketing.
Privacy-Protected Attribution Modeling: You can analyze marketing attribution (which channels lead to conversions) using noisy data to protect user interactions while maintaining the overall effectiveness of attribution models to optimize campaign spend.
Cross-Device Tracking without Exact Data Matching: In marketing campaigns that track user journeys across devices, noise can help reduce the precision of cross-device matching, maintaining privacy while still enabling marketers to understand multi-touch attribution paths.
The key idea behind differential privacy is to ensure that the results of any analysis or query on a dataset remain almost identical, whether or not an individual's data is included. This means that no end user can "difference" two snapshots of the dataset and deduce who the individuals are. By maintaining this consistency, differential privacy prevents anyone from inferring significant details about any specific person, even if they are aware that person’s data is part of the dataset.
Consider a database that tracks whether people have a particular medical condition. A simple query might ask, "How many people in the dataset have the condition?" Suppose the true count is 100. Now, imagine that a new person with the condition is added, increasing the count to 101. As the data scientist, you know that your neighbor has been very ill and that there is only one medical care provider nearby. Without differential privacy, this information could allow you to deduce that your neighbor is included in the dataset.
To prevent this, we can add a small amount of random noise before revealing the count. Instead of reporting exactly 100, we might reveal 102 or 99. If someone joins or leaves the dataset, the count could shift to 103 or 100, for instance. This noise ensures that the presence or absence of any individual doesn't significantly impact the result.
In this way, you, as the data scientist, cannot confidently determine whether a specific person is part of the dataset based on the output. And that is a good thing - the individual's privacy is protected, as their contribution is "hidden" within the noise.
The key idea in adding noise to ensure differential privacy is to balance two competing objectives:
Privacy: Protecting individuals’ data by making it difficult to infer whether a particular individual is in the dataset.
Utility: Ensuring that the analysis results remain useful and accurate for personalization despite the noise.
The tradeoffs are:
High Privacy → Lower Utility: When you add a lot of noise to data to protect privacy, the accuracy and reliability of the data and hence your personalization decrease
High Utility → Lower Privacy: On the other hand, if you reduce the noise to increase the accuracy (utility) of the data i.e. the personalization, the dataset becomes more representative of the actual individuals, which increases the risk of identifying someone.
In differential privacy, sensitivity (denoted as Δf) refers to how much the result of a query could change if a single individual's data is added or removed. It's not about the variability of the data itself, but about the potential impact any individual’s presence can have on the output. The higher the sensitivity, the greater the change an individual’s data can introduce to the result.
Let’s revisit the example of the medical condition dataset. If the condition can only have one of two values (e.g., "has the condition" or "does not"), it means the data has low sensitivity—since adding or removing one person will change the count by at most 1. However, this low sensitivity makes it easier for someone, like a data scientist, to start guessing which of their neighbors is in the dataset by correlating other fields, like treatments or appointment times.
Even though the sensitivity is low (since the result can only change by a small amount), the signal is strong because there is limited variation in the data. This means the individual’s presence becomes easier to detect, which can compromise privacy. To protect against this, we need to compensate by adding carefully calibrated noise. The amount of noise depends on the sensitivity: low sensitivity may require less noise, but it’s still essential to add enough to prevent any inference about specific individuals based on the dataset’s output. The amount of noise added is determined by a key privacy parameter known as epsilon (𝜀).
This balance between sensitivity and noise ensures that the final result provides useful insights while protecting the privacy of individuals.
In practice, you must choose an appropriate value for epsilon (𝜀) based on your specific needs and risk tolerance. Higher epsilon values might be suitable when the accuracy of data is critical (e.g., scientific research use cases), while lower epsilon values would be more appropriate in sensitive applications where privacy is the top priority (e.g., health data).
Laplacian noise refers to random noise drawn from a Laplace distribution, which looks like a pointy curve centered at 0. This noise is used to obscure or mask the precise value of a result so that it's difficult for an attacker to infer whether a specific individual’s data is present or absent.
In most systems (like SQL or programming languages), random numbers are typically generated from a uniform distribution, meaning the random values are equally likely to be anywhere within a certain range, such as between -0.5 and 0.5. This uniform distribution is very different from the Laplace distribution, which is concentrated around 0. So, we need a way to convert uniform random numbers into Laplacian-distributed numbers. This conversion is done using a transformation involving the logarithm function. The transformation converts the uniform random number into a value that follows a Laplace distribution.
The Laplace noise generation requires converting uniformly distributed random numbers (generated using RAND()
) into Laplace-distributed values, and this conversion relies on the inverse of the cumulative distribution function (CDF) of the Laplace distribution. This inverse transformation involves the logarithm function (LOG()
).
To generate Laplace noise for a random variable, we need to:
Generate a uniformly distributed random number U
in the range [−0.5,0.5]
Apply the transformation:
This transformation is necessary to convert the uniform distribution to a Laplace distribution
Where:
U
is a random value between -0.5 and 0.5.
sign(U)
ensures the noise is symmetrically distributed around 0 (positive or negative)
The transformation is necessary because uniform random numbers are not naturally spread out like a Laplace distribution. Most values from a Laplace distribution cluster around 0, and fewer values are far from 0. By using the logarithm, we adjust the uniform distribution so that it has this same characteristic: most values are close to 0, but there is still some chance of larger positive or negative values.
Deciding on the sensitivity of a set of Data Distiller Derived Attributes (combining numerical, Boolean, and categorical attributes) when applying differential privacy requires understanding how much the output of your query or function can change when a single individual's data is modified. The sensitivity will depend on the type of Derived Attribute and the function or model you are using.
The most common practice for finding the sensitivity of a derived attribute in a dataset is to examine the distribution of values that the derived attribute can take. This involves identifying the maximum change that can occur in the output of a query when a single individual’s data is added, removed, or modified. The sensitivity of the derived attribute is essentially the largest possible difference in the query result due to the presence or absence of one individual.
Let’s say you have a dataset of customers and you’re calculating a derived attribute called "total purchase amount" for each customer. This derived attribute is the sum of all purchases made by the customer over a specific period.
Step 1: Examine the distribution of the "purchase amount" attribute.
Suppose the purchase amounts range from $0 to $1,000.
Step 2: Determine the sensitivity by finding the maximum possible change in the derived attribute when one customer’s data is added or removed.
In this case, if a customer’s purchases are removed, the maximum change in the "total purchase amount" is $1,000 (if the customer made the maximum possible purchase of $1,000).
Thus, the sensitivity of the "total purchase amount" derived attribute is $1,000, because removing or adding a single customer could change the sum by that amount.
Once you’ve determined the sensitivity of your derived attributes, the next step in applying differential privacy is to decide on the privacy parameter known as epsilon (𝜀). Epsilon controls the trade-off between privacy and utility: it dictates how much noise needs to be added to your query results based on the sensitivity, ensuring that individual data points are protected.
Let’s continue with the example from earlier where you are calculating the "total purchase amount" for each customer. You’ve determined that the sensitivity of this derived attribute is $1,000, meaning that the maximum change in the query result due to one individual's data is $1,000.
If you choose 𝜀 = 0.1, the noise added to your total purchase amount query will be significant, ensuring strong privacy. For instance, a query result of $10,000 might be distorted to something like $9,000 or $11,000 due to the noise.
If you choose 𝜀 = 1.0, the noise added will be much smaller, possibly resulting in the total purchase amount being reported as $9,950 or $10,050, providing more accuracy but slightly weaker privacy protection.
You can start with ε = 0.5 as a solid starting point because it provides a moderate balance between privacy and utility. It introduces enough noise to protect privacy in many use cases without overly distorting the data. From there, you can iterate by adjusting the value of epsilon, testing how it impacts both the privacy protection and the accuracy of your use cases. By gradually increasing or decreasing ε, you can find the optimal balance between privacy needs and the utility required for your specific analysis.
magine you're analyzing healthcare customer data to segment patients based on age, total healthcare spending, and subscription status to healthcare services. These attributes are essential for tailoring healthcare plans, optimizing resource allocation, or delivering personalized healthcare recommendations. However, this data involves sensitive personal and health-related information, which requires a robust privacy-preserving approach.
The columns we'll include are:
PII Columns (to be dropped or anonymized):
customer_id
: Unique identifier (anonymized).
name
: Customer's full name (dropped).
phone_number
: Contact information (anonymized).
email
: Email address (anonymized).
address
: Physical address (dropped).
Non-PII Columns (used for marketing/healthcare segmentation):
age
: Numerical value representing customer age.
total_spent
: Total healthcare spending by the customer (numerical).
subscription_status
: Whether the customer has a healthcare subscription plan (boolean).
gender
: Categorical data.
country
: Categorical data representing the customer’s location.
diagnosis_code
: A code representing the medical condition (requires anonymization to protect patient data).
prescription
: The name of the prescription medicine (requires anonymization).
Let us execute the following query:
The result will be:
The PII columns (Personally Identifiable Information) such as name and address are typically dropped in differential privacy and anonymization processes for the following reasons.
Direct Identifiability:
Name: This is a direct identifier. Names can easily be linked back to specific individuals, making it impossible to protect privacy if they are included. Simply adding noise or anonymizing other attributes would not protect a person's identity if their full name is still present.
Address: Similarly, addresses are highly specific to individuals and can be easily used to trace back to a person. Even partial addresses or zip codes can be cross-referenced with public records or other data sources to identify someone.
Many privacy laws and regulations, such as GDPR in Europe and HIPAA in the United States, require the removal of identifiable data like names and addresses in datasets before sharing or using them for analytics. Keeping such columns in the dataset would violate these privacy regulations.
Even when other information is anonymized, attackers can perform linkage attacks by combining multiple datasets. For example, if an attacker knows a person’s address from another dataset, they could link that information with your dataset if the address is still present.
The result is:
Here are some decisions we will make on the remaining columns:
Customer ID (Anonymized via Hashing): The customer_id is often used as a key for uniquely identifying records, linking data across systems, or performing analysis without needing personal details like names. It is important for analytics purposes to track individuals in a dataset, but it should be anonymized to protect their identity.
Phone Number (Masked to Show Only the Last 4 Digits): The phone number can still provide some valuable information, such as area code for regional analysis, or the last 4 digits for certain use cases (e.g., verification of identity, identifying duplicate entries). Masking helps retain partial information for specific analyses.
Email (Anonymized via Hashing): Emails are often used for customer communication and identifying duplicates or tracking interactions. However, email addresses are highly sensitive because they can be linked to an individual, both within and outside the organization.
Hashing transforms the customer ID and Email into a unique but irreversible code, ensuring that the original ID/Email cannot be retrieved or linked back to the person. This allows the dataset to retain its uniqueness and analytic power while ensuring privacy.
By masking most of the digits and revealing only the last 4 digits, we ensure that the phone number is no longer personally identifiable. The last 4 digits alone are not sufficient to identify someone but may be useful for business logic purposes (e.g., verifying uniqueness).
Let us execute the following query:
Observe the results very carefully:
SHA-256 (Secure Hash Algorithm 256-bit) is part of the SHA-2 family of cryptographic hash functions. It generates a 256-bit (32-byte) hash value, typically represented as a 64-character hexadecimal number.
SUBSTRING(phone_number, -4)
extracts the last 4 characters of the phone number. The -4
index indicates that the function should start 4 characters from the end of the string.
We will leave some of these values untouched:
Diagnosis Code and Prescription: These columns are critical for certain types of healthcare segmentation (e.g., segmenting patients based on medical conditions or treatments).
Gender is often used for segmentation (e.g., marketing or healthcare demographic analysis).
Leave subscription_status unhashed because it is useful for segmentation and doesn't reveal personal identity
country is typically used for geographic segmentation, which is important for understanding customer behavior or demographics in different regions.
Let us now compute the sensitivity and the epsilon for these two variables Age and Total Spent.
Epsilon for age
: The formula uses ϵ=0.5\epsilon = 0.5ϵ=0.5 for the age
field.
Sensitivity for age
: The sensitivity for age is assumed to be 1.0 as the maximum variation in age can be 1.0 across two snapshots of data.
Total Spent
psilon for total_spent
: The same ϵ=0.5\epsilon = 0.5ϵ=0.5 is used for the total_spent
field.
Sensitivity for total_spent
: The sensitivity for total_spent is 500, reflecting the assumption that one individual’s spending could change the total by as much as $500.
Let us execute the following to generate the unform random values for each column and each row:
If you execute:
You will get:
When dealing with categorical variables in the context of differential privacy, it’s important to consider both the sensitivity and the cardinality (i.e., the number of unique categories) of the variable. For high-cardinality categorical features, such as customer locations or product names, applying privacy techniques like one-hot encoding or feature hashing is common in machine learning tasks. One-hot encoding transforms each category into a binary vector, where each unique category becomes its own column, making the data more interpretable for machine learning models. However, this approach can lead to a large number of columns if the cardinality is high, potentially affecting performance and privacy.
In contrast, feature hashing (also known as the hashing trick) compresses high-cardinality categorical data by mapping categories to a fixed number of buckets using a hash function. While this reduces the number of columns and makes the dataset more manageable, it introduces collisions where different categories can be hashed into the same bucket. When applying differential privacy to categorical variables, it’s important to consider the sensitivity, which could be influenced by the number of possible categories. High-cardinality variables might require more noise to ensure privacy, or you could aggregate categories to reduce the cardinality and thus the required sensitivity.
The best practice in hashing is that the number of buckets should be atleast equal to the number of cardinality values.
For categorical variables, it is generally safe to assume that the sensitivity is 1 in the context of differential privacy. This assumption is commonly used when the query involves counting or querying the frequency of categories because the sensitivity reflects the maximum possible change in the query result when a single individual’s data is added or removed.
One-Hot Encoding Example for country column:
In one-hot encoding, each unique country will become its own binary column. For simplicity, let's assume we have three countries: USA, Canada, and Germany.
The result would be:
Feature Hashing Example for country
column:
In feature hashing, we map the country
values to a fixed number of hash buckets. Let’s assume we want to map the country
column to 3 hash buckets.
The result would be: