IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
Mastering profile cleanup transforms data chaos into clarity, enabling accurate, unified real-time customer profiles with 15+ algorithms.
Last updated
Mastering profile cleanup transforms data chaos into clarity, enabling accurate, unified real-time customer profiles with 15+ algorithms.
Last updated
For this tutorial, you will need to ingest the following dataset:
using the steps outlined in:
Here is a recap of the fundamental concepts of Adobe Experience Platform. Data enters the Adobe Experience Platform through edge, streaming, and batch methods. Regardless of the ingestion mode, all data must find its place in a dataset within the Platform Data Lake. The ingestible data falls into three categories: attributes, events, and lookups. The Real-Time Customer Profile operates with two concurrent stores – a Profile Store and an Identity Store. The Profile Store takes in and partitions data based on the storage key, which is the primary identity. Meanwhile, the Identity Store continuously seeks associations in among identities, including the primary one, within the ingested record, utilizing this information to construct the identity graph. These two stores, accessing historical data in the Platform Data Lake, open avenues for advanced modeling, such as RFM, among other techniques.
Deterministic and probabilistic identity resolution are two methods used to match and merge customer identities across data sources, each with its own trade-offs.
Deterministic identity resolution relies on exact matches of unique identifiers, such as email addresses, phone numbers, or customer IDs. This approach is highly accurate when the data is consistent, ensuring a precise link between records. However, its limitation lies in the need for identical identifiers across systems, which can reduce match rates when variations in data exist. Adobe Experience Platform uses deterministic identity resolution.
On the other hand, probabilistic identity resolution uses statistical algorithms and data attributes (e.g., name, location, browsing behavior) to calculate the likelihood that different records belong to the same individual. While this approach increases the chances of finding connections even when identifiers differ, it introduces some uncertainty, as matches are based on probability rather than certainty.
Profile collapse in Adobe Experience Platform occurs when data from different individuals is mistakenly merged into a single customer profile. This typically happens due to identity management issues or data inconsistencies during the process of data ingestion and identity resolution.
Data can enter Adobe Experience Platform (AEP) via edge, streaming, or batch ingestion methods. Regardless of how it arrives, all data is ultimately stored within the Platform Data Lake and categorized as attributes, events, or lookups. To control how the identity algorithm processes data, you need to prepare the data before it enters the Identity Store.
There are two scenarios for data handling:
Batch Ingestion: In this case, data should be processed by Data Distiller and then ingested into both stores. This approach works seamlessly, as Data Distiller can handle these jobs in either full or incremental processing modes.
Real-Time Ingestion (Edge or Streaming): In this scenario, processing cannot occur immediately because you need to act on data in real time. However, all real-time data is stored in the Data Lake, allowing for periodic identity operations when system downtime is possible, preferably during low-traffic periods. To perform these operations, follow these steps:
Disable the existing dataflow for Dataset A, where real-time data is currently being ingested.
Unmark Dataset A for Profile.
Apply the identity algorithm operations to Dataset A, creating a new Dataset B and marking it for Profile. Ensure Dataset B is empty and marked for Profile before hydrating it.
Set up a new dataflow from the same source to Dataset B. This setup will ensure that both historical and new streaming data are ingested seamlessly.
Example: When multiple users share the same device or browser, the same Experience Cloud ID (ECID) is used for all interactions.
How It Causes Profile Collapse: Since the ECID remains constant across different users, AEP may mistakenly associate all interactions with a single profile, merging data from multiple individuals. For example, if two customers log in using the same device, their distinct CRM IDs may both get linked to the same ECID, leading to a merged profile in the Profile Store.
Example: A single user may appear under different identifiers across data sources or channels, such as different CRM IDs for B2B and B2C interactions, different emails, or multiple login credentials.
How It Causes Profile Collapse: When AEP tries to link these identities, it may merge data from different identifiers into a single profile. If the linking is done inconsistently or incorrectly, it can lead to a collapsed profile where data from different roles or personas of the same individual gets combined inappropriately.
Example: Poor data quality can lead to duplicate records where the same person is listed multiple times with slight variations in the data.
How It Causes Profile Collapse: If these duplicate records are not accurately distinguished, AEP's identity resolution may treat them as separate initially but later merge them into one profile, resulting in collapsed data..
Example: Users interact across multiple devices or channels, generating different identifiers (e.g., cookies, mobile IDs, ECIDs).
How It Causes Profile Collapse: If identity resolution does not accurately map these identifiers back to the same person, data from different users who share similar device or channel characteristics may be incorrectly merged, causing profile collapse.
Example: Multiple users on the same network (e.g., public Wi-Fi) can appear to share similar network-based identifiers, such as IP addresses.
How It Causes Profile Collapse: If network-based identifiers are used as part of the identity resolution, this can result in incorrect associations between different users on the same network, leading to merged profiles.
Using IP addresses as identifiers in profile resolution can be problematic because they do not uniquely represent individual users. Here are several reasons why IP addresses can lead to inaccurate identity resolution and profile collapse:
Shared Networks: Multiple users often share the same IP address when they are on the same network, such as public Wi-Fi in a coffee shop, office network, or a household router. In these scenarios, an IP-based identifier may incorrectly group different users as one, resulting in profile collapse.
Dynamic IP Addresses: Many internet service providers (ISPs) assign dynamic IP addresses, meaning a user’s IP can change over time. For instance, when a user reconnects to the internet or moves between networks, they might receive a new IP address. This variability can lead to fragmented profiles for the same user or incorrect matches with other users.
Proxy Servers and VPNs: Users who access the internet via a proxy server or a virtual private network (VPN) can share the same IP address across multiple devices and locations. This introduces further ambiguity, as users from entirely different networks can appear to have identical IP addresses, complicating identity resolution.
Geographic Misinterpretations: IP addresses can sometimes be misleading in terms of location, especially with mobile carriers or large ISPs that route traffic through different hubs. Users may appear to be connecting from one location even if they are in a completely different one, which can skew profile data and misrepresent user behavior.
IP Address Rotation and NAT: Many large networks, especially in enterprise or cellular environments, use Network Address Translation (NAT), allowing multiple devices to share a single public IP address. In these cases, hundreds or thousands of users may appear to be connecting from the same IP, making it an unreliable identifier for individual profiles.
Example: Integrating third-party data with different identity keys (e.g., hashed emails, mobile IDs) can complicate the identity resolution process.
How It Causes Profile Collapse: If the third-party identity keys are not consistently linked to primary identifiers in AEP, data from different individuals may be mistakenly merged.
Keeping prospect data separate from existing customer data in Adobe Experience Platform is crucial for several reasons, primarily around data accuracy, compliance, and targeted engagement. Here’s a breakdown of why this separation is important:
Different Data Quality and Attributes: Prospect data often lacks the depth and reliability of existing customer data, as it may come from third-party sources, form fills, or inferred behavior. Mixing it with verified customer data can dilute the accuracy of customer profiles and lead to mistaken assumptions or merges.
Targeted Engagement Strategies: Prospects and existing customers are at different stages in the customer journey and require distinct engagement strategies. Keeping them separate allows businesses to personalize messaging based on the user’s status—whether they are a potential customer needing nurturing or an existing customer who may benefit from cross-selling or loyalty programs.
Compliance and Privacy Requirements: Privacy regulations often impose stricter handling requirements for prospect data, as it may include inferred interests or demographics without direct consent. By isolating prospect data, you can manage it according to specific data handling policies, reducing the risk of inadvertently using unconsented data in customer-specific actions or analyses.
Avoiding Profile Collapse and Data Pollution: If prospect data is integrated with customer data prematurely, it increases the likelihood of profile collapse, where profiles may merge incorrectly due to weak or partial identifiers. This mixing can lead to inaccurate profiling, mistaken identity resolution, and ineffective targeting. Keeping the data separate helps maintain clean, verified customer records.
Flexible and Scalable Identity Resolution: By separating prospects, organizations can apply tailored identity resolution processes, especially if the prospect data has a different set of identifiers or is less frequently updated. This approach ensures that identity resolution can be scaled and adjusted for prospects without impacting customer data accuracy.
Example: Data sources may update at different intervals, such as real-time data streams versus weekly CRM updates.
How It Causes Profile Collapse: If identities are linked or unlinked based on out-of-sync information, the Identity Store may merge profiles incorrectly.
Example: Errors in the data ingestion process can send incorrect identity mapping information to AEP.
How It Causes Profile Collapse: These mistakes can lead to improper associations between identifiers, causing unrelated profiles to be merged.
Profile collapse impacts the accuracy and effectiveness of customer data by:
Incorrect Personalization: Merging unrelated data results in irrelevant or misleading personalization, causing customers to receive content or offers that don't apply to them.
Data Quality and Analysis Issues: Aggregated metrics may reflect combined behaviors of different individuals, leading to skewed analysis and inaccurate insights.
Privacy Risks: Mixing data from multiple users can unintentionally expose personal information to the wrong individual.
To minimize the risk of profile collapse, consider:
Refining Identity Matching Rules: Use more precise matching criteria and multiple attributes to accurately resolve identities.
Improving Data Quality: Address duplicate records and inconsistencies at the source before ingestion.
Configuring Identity Graphs Carefully: Ensure identity resolution rules and graph configurations account for different identity sources and their unique characteristics.
Regular Monitoring and Auditing: Implement checks to detect and rectify potential profile collapses.
In the context of profile collapse, the trade-off between deterministic and probabilistic identity resolution revolves around balancing accuracy and coverage.
Deterministic methods offer high accuracy by linking profiles only when there are exact matches on unique identifiers, reducing the risk of mistakenly merging different individuals but potentially leaving some profiles fragmented if identifiers are inconsistent or missing.
On the other hand, probabilistic methods expand coverage by using statistical algorithms to match records based on similarities in behaviors, patterns, or non-unique identifiers. While this approach can unify more profiles even when exact matches are unavailable, it also increases the likelihood of false positives, leading to profile collapse where data from different individuals is mistakenly combined.
To mitigate this, typically a hybrid approach can be used, starting with deterministic matches and then leveraging probabilistic methods to fill in gaps, aiming to balance the need for comprehensive identity resolution with the risk of inaccurate profile merging.
First, we will identify the profiles that are collapsed, based on the presence of different CRM_IDs
for the same ECID
.
This query identifies the ECID
values associated with multiple CRM_IDs
, indicating potential collapsed profiles.
When cleaning datasets to resolve collapsed profiles caused by multiple CRM_IDs linked to the same ECID, there are various rule-based strategies that can be employed. The choice of strategy depends on business requirements and the characteristics of the data. Here are some possible rule-based strategies to resolve such profile collapses:
Description: This strategy focuses on retaining the most recent record for each ECID
by using the Login_Timestamp
to identify the latest activity.
Use Case: This approach is valuable when the latest activity is considered the most accurate or relevant representation of a user profile. By keeping only the most recent data, you ensure that the profile reflects the most current user information, excluding outdated or redundant records that may no longer be valid.
Implementation Example: Execute the following code blocks sequentially to implement this method, making sure to run each step individually to maintain data integrity:
Result is not shown as it should return no rows as shown below:
In the algorithms below, we won’t display the screen with empty rows, but if the algorithms function correctly, a screen with no rows returned as shown above is the expected outcome.
Description: This strategy retains only the earliest record for each ECID
, determined by the Login_Timestamp
.
Use Case: This approach is useful when the initial identification or first interaction with a profile is considered the most reliable, ensuring that any subsequent, potentially inconsistent updates do not affect the profile’s original data.
Implementation Example: Execute each code block one at a time to avoid conflicts and ensure data integrity:
Description: This strategy retains records based on a preferred CRM_ID
type, giving priority to a specific type (e.g., always keeping B2C
over B2B
).
Use Case: This approach is helpful when certain customer relationship types are more important for analysis or operations. For example, retail customers (B2C
) may be prioritized over business customers (B2B
) to focus on individual consumer behavior in retail environments.
Implementation Example: Execute each code block sequentially to ensure the correct data is retained without conflicts:
Description: Apply a scoring system to rank profiles by various attributes (such as login recency, interaction count, or device type), then retain the highest-scoring record for each ECID
.
Use Case: Ideal when determining the "best" record requires multiple criteria, offering a more refined method for resolving collapsed profiles.
Implementation Example: Execute each code block in sequence to ensure accurate and conflict-free data processing:
Description: Instead of discarding any records, merge the attributes from all records associated with the same ECID
. This could involve creating lists of values, aggregating metrics, or applying other transformation rules.
Use Case: Suitable when all available information is valuable and should be preserved, such as combining multiple phone numbers or email addresses associated with a profile.
Implementation Example: Make sure you execute the code blocks one by one
Description: If an ECID
is associated with multiple CRM_ID
types, keep the latest entry for each type.
Use Case: Useful in scenarios where having both B2B and B2C relationships is important for the same user, and the latest activity for each type is relevant.
Implementation Example: Make sure you execute the code blocks one by one
Description: If an ECID
is linked to different CRM_ID
types that cannot be resolved through any of the above rules, these profiles can be flagged or removed for manual review.
Use Case: Suitable for cases where automation cannot confidently resolve the conflicts, requiring human oversight.
Implementation Example: Make sure you execute the code blocks one by one
Description: Reference an external dataset, such as a master customer list, to resolve which CRM_ID
should be considered valid.
Use Case: This approach is useful when a trusted external source can guide the resolution, especially when there are well-maintained external records.
Implementation Example: Make sure you execute the code blocks one by one
Description: Assign confidence scores to each CRM_ID
association based on various factors, such as the number of interactions, consistency of login information, or verification level. Higher confidence scores indicate stronger associations.
Use Case: Useful when multiple identifiers are present, and some are more reliable than others. This strategy helps prioritize the most trusted associations.
Implementation Example: Make sure you execute the code blocks one by one
This query creates a cleaned dataset by selecting the highest-confidence record for each unique ECID
, effectively reducing profile overlap and potential data duplication. It starts by calculating a Confidence_Score
for each record in the example_dataset
based on the frequency of occurrences and device type, with desktop interactions receiving a higher weight. Next, it ranks each record within each ECID
group according to this confidence score, assigning the top rank to the highest-scoring entry. Finally, the query filters out only the top-ranked record for each ECID
, ensuring that each profile is represented by a single, most reliable entry. This approach helps to minimize profile collapse by excluding lower-confidence records and prioritizing the most relevant data for each unique customer.
Description: Retain records based on a time-based rule, such as keeping data within a specific time window (e.g., last six months). Older data can be archived or discarded.
Use Case: Suitable when the most recent data is more relevant than historical data, or when you want to reduce data volume while keeping the latest information.
Implementation Example: Make sure you execute the code blocks one by one
Description: Aggregate certain fields for profiles that share the same ECID
, combining data using aggregation functions (e.g., summing transaction counts, taking the latest email address).
Use Case: Useful for combining behavioral data while still allowing for unique identifiers to coexist within a merged profile.
Implementation Example: Make sure you execute the code blocks one by one
Description: Use business-specific rules to decide which records to keep. For example, prioritize records from specific CRM systems (e.g., always keep records from a particular system if multiple sources are integrated).
Use Case: Effective in organizations with a well-defined hierarchy of data sources, where some data sources are known to be more reliable than others.
Implementation Example: Make sure you execute the code blocks one by one
Description: Combine multiple fields (e.g., ECID
+ Device_Type
) to create a composite identifier for deduplication. This strategy adds additional granularity to the process.
Use Case: Helpful in distinguishing between profiles that share the same ECID
but differ in other aspects like device type or location.
Implementation Example: Make sure you execute the code blocks one by one
The above query aims to clean up a dataset to prevent profile collapse by ensuring that each ECID
(a unique customer identifier) is represented by only one record, even if there are multiple entries across different devices or CRM IDs. This is achieved through a two-step deduplication process. In the first step, we use a subquery to partition the data by both ECID
and Device_Type
, ordering by Login_Timestamp
in descending order to keep only the most recent record for each unique combination of ECID
and Device_Type
. This intermediate result ensures that for each ECID
, only the latest activity per device type is retained. In the second step, an outer query applies another ROW_NUMBER()
partitioned solely by ECID
, ordering again by Login_Timestamp
in descending order. By selecting only the top-ranked record (final_rn = 1
), the query retains only the latest entry per ECID
across all device types. This two-layered approach effectively removes duplicate entries and prevents profile collapse by consolidating each ECID
into a single, most recent record, providing a clean and unified dataset.
Description: Use clustering algorithms (e.g., k-means, hierarchical clustering) to group similar profiles and resolve which ones are likely to represent the same individual.
Use Case: Useful when there are subtle differences in data attributes across profiles, and statistical methods can help identify groups that should be merged.
Implementation Example: This approach would be executed inside Data Distiller Advanced Statistics and Machine Learning features.
To address profile collapse, our clustering should identify profiles that are likely to belong to the same individual based on shared or similar identifiers. Here’s an approach focused on preventing profile collapse:
Device ID Consistency:
Count the number of unique Device_Type
values associated with each ECID
and CRM_ID
.
Profiles with a wide variety of device types may indicate multiple users on shared devices, which could contribute to profile collapse.
device_type_variety
: Counts the distinct Device_Type
s used by each ECID
. High variety here may indicate a shared device or cross-identity usage.
Shared CRM Identifiers:
Calculate the frequency of each CRM_ID
per ECID
. High frequencies might indicate cases where multiple profiles with the same CRM_ID
are collapsed into one.
A high count of distinct CRM_ID
s per ECID
suggests possible identity collisions (e.g., a B2B and a B2C CRM_ID
on the same ECID
).
crm_id_count
: Counts distinct CRM_ID
s per ECID
, helping to detect cases where multiple identities are merged into one, indicating a potential profile collapse.
Login Recency:
While not as direct, recency can still help detect anomalies where profile collapse might have caused inactive or mismatched data.
login_recency
: While not a direct indicator, it can provide context for activity levels, which may be useful if certain collapsed profiles appear inactive compared to expected activity.
By clustering based on these features, you’re more likely to identify clusters where ECID
s are artificially merged due to overlapping CRM_ID
s or device types. Profiles within the same cluster that have multiple CRM_ID
s could be flagged for further review or disambiguation, effectively isolating cases where profile collapse is likely occurring.
Description: Train a machine learning model (e.g., logistic regression, decision trees) to predict whether two profiles should be merged based on features like similarity of CRM_IDs, login patterns, or device types.
Use Case: Ideal for complex datasets where simple rules don't capture the nuances, and historical labeled data is available to train a model.
Implementation Example: Use a machine learning framework to train the model, then apply the model predictions in SQL to clean up the dataset.
Assume that:
We have historical data with labeled pairs of ECID
s and their features (CRM_ID similarity
, device type overlap
, login frequency similarity
, etc.).
We use logistic regression for binary classification to predict “merge” or “do not merge” for each profile pair.
Let us first generate the feature engineering dataset:
Train the logistics regression model:
Description: Identify high-risk profile collapses (e.g., large discrepancies in profile attributes) and flag them for manual review.
Use Case: When automated processes cannot resolve all cases with high confidence, manual intervention may be required for certain profiles.
Implementation Example: Make sure you execute the code blocks one by one
Description: Combine multiple strategies by assigning confidence levels to each and merging profiles based on the highest combined confidence.
Use Case: Ideal when no single rule can address all cases effectively, allowing a multi-strategy approach to improve resolution accuracy.
Implementation Example: Create multiple confidence scores based on different rules, then aggregate these to determine the final outcome.
Description: Apply fuzzy matching on attributes like CRM_ID
or name fields to identify records that are close matches but not exact. This can help clean up cases where variations in identifiers exist.
Use Case: Helpful when there are data entry errors or slight differences in CRM_ID formats that contribute to profile collapses.
Implementation Example: Use Data Distiller preprocess the data with fuzzy matching techniques and then apply cleaned values in SQL. There is a detailed tutorial for this located here.
Description: Implement auditing rules where profiles with certain characteristics (e.g., frequent logins from different devices) are automatically flagged for reconciliation checks.
Use Case: Ensures ongoing monitoring of data integrity by routinely checking for signs of potential profile collapses.
Implementation Example: Make sure you execute the code blocks one by one
The appropriate strategy will depend on factors such as:
Data Quality: How consistent and accurate is the CRM_ID data?
Business Rules: Are there clear rules for prioritizing certain records over others?
Data Usage: Will all the attributes be needed, or can some be safely discarded?
Combining multiple strategies may also be necessary, such as merging attributes for certain profiles while prioritizing the latest records for others. Testing different approaches on sample datasets can help in choosing the optimal strategy for your specific scenario
To ensure the cleanup worked, always run a check to see if any ECID
still has multiple CRM_IDs
:
If this query returns no results, the cleanup was successful.
If the cleaned dataset meets your criteria, you can replace the original dataset with the cleaned version or store it as a new dataset for further processing.