IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
Mastering profile cleanup transforms data chaos into clarity, enabling accurate, unified real-time customer profiles with 15+ algorithms.
Prerequisites
For this tutorial, you will need to ingest the following dataset:
using the steps outlined in:
PREP 500: Ingesting CSV Data into Adobe Experience PlatformUnderstanding Profile Collapse

Here is a recap of the fundamental concepts of Adobe Experience Platform. Data enters the Adobe Experience Platform through edge, streaming, and batch methods. Regardless of the ingestion mode, all data must find its place in a dataset within the Platform Data Lake. The ingestible data falls into three categories: attributes, events, and lookups. The Real-Time Customer Profile operates with two concurrent stores – a Profile Store and an Identity Store. The Profile Store takes in and partitions data based on the storage key, which is the primary identity. Meanwhile, the Identity Store continuously seeks associations in among identities, including the primary one, within the ingested record, utilizing this information to construct the identity graph. These two stores, accessing historical data in the Platform Data Lake, open avenues for advanced modeling, such as RFM, among other techniques.
Profile collapse in Adobe Experience Platform occurs when data from different individuals is mistakenly merged into a single customer profile. This typically happens due to identity management issues or data inconsistencies during the process of data ingestion and identity resolution.
Data can enter Adobe Experience Platform (AEP) via edge, streaming, or batch ingestion methods. Regardless of how it arrives, all data is ultimately stored within the Platform Data Lake and categorized as attributes, events, or lookups. To control how the identity algorithm processes data, you need to prepare the data before it enters the Identity Store.
There are two scenarios for data handling:
Batch Ingestion: In this case, data should be processed by Data Distiller and then ingested into both stores. This approach works seamlessly, as Data Distiller can handle these jobs in either full or incremental processing modes.
Real-Time Ingestion (Edge or Streaming): In this scenario, processing cannot occur immediately because you need to act on data in real time. However, all real-time data is stored in the Data Lake, allowing for periodic identity operations when system downtime is possible, preferably during low-traffic periods. To perform these operations, follow these steps:
Disable the existing dataflow for Dataset A, where real-time data is currently being ingested.
Unmark Dataset A for Profile.
Apply the identity algorithm operations to Dataset A, creating a new Dataset B and marking it for Profile. Ensure Dataset B is empty and marked for Profile before hydrating it.
Set up a new dataflow from the same source to Dataset B. This setup will ensure that both historical and new streaming data are ingested seamlessly.
Shared Devices or Browsers
Example: When multiple users share the same device or browser, the same Experience Cloud ID (ECID) is used for all interactions.
How It Causes Profile Collapse: Since the ECID remains constant across different users, AEP may mistakenly associate all interactions with a single profile, merging data from multiple individuals. For example, if two customers log in using the same device, their distinct CRM IDs may both get linked to the same ECID, leading to a merged profile in the Profile Store.
Identity Fragmentation (Multiple Identities for the Same Individual)
Example: A single user may appear under different identifiers across data sources or channels, such as different CRM IDs for B2B and B2C interactions, different emails, or multiple login credentials.
How It Causes Profile Collapse: When AEP tries to link these identities, it may merge data from different identifiers into a single profile. If the linking is done inconsistently or incorrectly, it can lead to a collapsed profile where data from different roles or personas of the same individual gets combined inappropriately.
Duplicate Records
Example: Poor data quality can lead to duplicate records where the same person is listed multiple times with slight variations in the data.
How It Causes Profile Collapse: If these duplicate records are not accurately distinguished, AEP's identity resolution may treat them as separate initially but later merge them into one profile, resulting in collapsed data..
Cross-Device or Cross-Channel Tracking
Example: Users interact across multiple devices or channels, generating different identifiers (e.g., cookies, mobile IDs, ECIDs).
How It Causes Profile Collapse: If identity resolution does not accurately map these identifiers back to the same person, data from different users who share similar device or channel characteristics may be incorrectly merged, causing profile collapse.
Shared Network Environments (e.g., Public Wi-Fi)
Example: Multiple users on the same network (e.g., public Wi-Fi) can appear to share similar network-based identifiers, such as IP addresses.
How It Causes Profile Collapse: If network-based identifiers are used as part of the identity resolution, this can result in incorrect associations between different users on the same network, leading to merged profiles.
Third-Party Data with Inconsistent Identity Keys
Example: Integrating third-party data with different identity keys (e.g., hashed emails, mobile IDs) can complicate the identity resolution process.
How It Causes Profile Collapse: If the third-party identity keys are not consistently linked to primary identifiers in AEP, data from different individuals may be mistakenly merged.
Different Data Sources with Varying Update Frequencies
Example: Data sources may update at different intervals, such as real-time data streams versus weekly CRM updates.
How It Causes Profile Collapse: If identities are linked or unlinked based on out-of-sync information, the Identity Store may merge profiles incorrectly.
Incorrectly Implemented Identity Resolution Logic in Ingested Data
Example: Errors in the data ingestion process can send incorrect identity mapping information to AEP.
How It Causes Profile Collapse: These mistakes can lead to improper associations between identifiers, causing unrelated profiles to be merged.
Why Profile Collapse Matters
Profile collapse impacts the accuracy and effectiveness of customer data by:
Incorrect Personalization: Merging unrelated data results in irrelevant or misleading personalization, causing customers to receive content or offers that don't apply to them.
Data Quality and Analysis Issues: Aggregated metrics may reflect combined behaviors of different individuals, leading to skewed analysis and inaccurate insights.
Privacy Risks: Mixing data from multiple users can unintentionally expose personal information to the wrong individual.
To minimize the risk of profile collapse, consider:
Refining Identity Matching Rules: Use more precise matching criteria and multiple attributes to accurately resolve identities.
Improving Data Quality: Address duplicate records and inconsistencies at the source before ingestion.
Configuring Identity Graphs Carefully: Ensure identity resolution rules and graph configurations account for different identity sources and their unique characteristics.
Regular Monitoring and Auditing: Implement checks to detect and rectify potential profile collapses.
Querying the Data to Identify Collapsed Profiles
First, we will identify the profiles that are collapsed, based on the presence of different CRM_IDs
for the same ECID
.
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
This query identifies the ECID
values associated with multiple CRM_IDs
, indicating potential collapsed profiles.

Mitigating Profile Collapse Strategies
When cleaning datasets to resolve collapsed profiles caused by multiple CRM_IDs linked to the same ECID, there are various rule-based strategies that can be employed. The choice of strategy depends on business requirements and the characteristics of the data. Here are some possible rule-based strategies to resolve such profile collapses:
Keeping the Latest Record
Description: This strategy focuses on retaining the most recent record for each ECID
by using the Login_Timestamp
to identify the latest activity.
Use Case: This approach is valuable when the latest activity is considered the most accurate or relevant representation of a user profile. By keeping only the most recent data, you ensure that the profile reflects the most current user information, excluding outdated or redundant records that may no longer be valid.
Implementation Example: Execute the following code blocks sequentially to implement this method, making sure to run each step individually to maintain data integrity:
DROP TEMP TABLE IF EXISTS cleaned_dataset;
-- Latest Record Algorithm
CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset a
WHERE Login_Timestamp = (
SELECT MAX(Login_Timestamp)
FROM example_dataset b
WHERE a.ECID = b.ECID
);
--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
Result is not shown as it should return no rows as shown below:

In the algorithms below, we won’t display the screen with empty rows, but if the algorithms function correctly, a screen with no rows returned as shown above is the expected outcome.
Keeping the Earliest Record
Description: This strategy retains only the earliest record for each
ECID
, determined by theLogin_Timestamp
.Use Case: This approach is useful when the initial identification or first interaction with a profile is considered the most reliable, ensuring that any subsequent, potentially inconsistent updates do not affect the profile’s original data.
Implementation Example: Execute each code block one at a time to avoid conflicts and ensure data integrity:
DROP TEMP TABLE IF EXISTS cleaned_dataset;
-- Earliest Record Algorithm
CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset a
WHERE Login_Timestamp = (
SELECT MIN(Login_Timestamp)
FROM example_dataset b
WHERE a.ECID = b.ECID
);
--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
Prioritizing a Specific CRM_ID Type
Description: This strategy retains records based on a preferred
CRM_ID
type, giving priority to a specific type (e.g., always keepingB2C
overB2B
).Use Case: This approach is helpful when certain customer relationship types are more important for analysis or operations. For example, retail customers (
B2C
) may be prioritized over business customers (B2B
) to focus on individual consumer behavior in retail environments.Implementation Example: Execute each code block sequentially to ensure the correct data is retained without conflicts:
DROP TEMP TABLE IF EXISTS cleaned_dataset;
-- Prioritizing a Specific Identity Type
CREATE TABLE cleaned_dataset AS
SELECT *
FROM example_dataset
WHERE CRM_ID LIKE 'B2C_%';
--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
Resolving Based on a Scoring System
Description: Apply a scoring system to rank profiles by various attributes (such as login recency, interaction count, or device type), then retain the highest-scoring record for each
ECID
.Use Case: Ideal when determining the "best" record requires multiple criteria, offering a more refined method for resolving collapsed profiles.
Implementation Example: Execute each code block in sequence to ensure accurate and conflict-free data processing:
DROP TEMP TABLE IF EXISTS cleaned_dataset;
CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Score DESC) AS rn
FROM (
SELECT *,
-- Scoring logic: more recent logins and 'Desktop' device type score higher
CASE
WHEN Device_Type = 'Desktop' THEN 10 ELSE 5 END +
DATEDIFF(day, '2024-10-01', Login_Timestamp) AS Score
FROM example_dataset
) scored_data
) ranked_data
WHERE rn = 1;
--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
Merging Attributes from Collapsed Records
Description: Instead of discarding any records, merge the attributes from all records associated with the same
ECID
. This could involve creating lists of values, aggregating metrics, or applying other transformation rules.Use Case: Suitable when all available information is valuable and should be preserved, such as combining multiple phone numbers or email addresses associated with a profile.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, STRING_AGG(CRM_ID, ', ') AS CRM_ID, MAX(Login_Timestamp) AS Last_Login, COUNT(DISTINCT Device_Type) AS Unique_Device_Count FROM example_dataset GROUP BY ECID; --Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Keeping a Combination of the Latest Record by Type
Description: If an
ECID
is associated with multipleCRM_ID
types, keep the latest entry for each type.Use Case: Useful in scenarios where having both B2B and B2C relationships is important for the same user, and the latest activity for each type is relevant.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, CRM_ID, Device_Type, Login_Timestamp FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY CASE WHEN SUBSTRING(CRM_ID, 1, 3) = 'B2C' THEN 1 ELSE 2 END, Login_Timestamp DESC) AS rn FROM example_dataset ) filtered_data WHERE rn = 1; -- Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Removing Profiles with Ambiguous Identity Mapping
Description: If an
ECID
is linked to differentCRM_ID
types that cannot be resolved through any of the above rules, these profiles can be flagged or removed for manual review.Use Case: Suitable for cases where automation cannot confidently resolve the conflicts, requiring human oversight.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; -- Step 1: Identify ambiguous profiles (those with multiple CRM_IDs per ECID) CREATE TEMP TABLE ambiguous_profiles AS SELECT ECID FROM example_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1; -- Step 2: Create the cleaned dataset by excluding ambiguous profiles CREATE TEMP TABLE cleaned_dataset AS SELECT * FROM example_dataset WHERE ECID NOT IN (SELECT ECID FROM ambiguous_profiles); -- Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Utilizing External Reference Data for Resolution
Description: Reference an external dataset, such as a master customer list, to resolve which
CRM_ID
should be considered valid.Use Case: This approach is useful when a trusted external source can guide the resolution, especially when there are well-maintained external records.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT a.* FROM example_dataset a JOIN master_customer_list b ON a.CRM_ID = b.Valid_CRM_ID; --Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Using Confidence Scores for Identity Resolution
Description: Assign confidence scores to each
CRM_ID
association based on various factors, such as the number of interactions, consistency of login information, or verification level. Higher confidence scores indicate stronger associations.Use Case: Useful when multiple identifiers are present, and some are more reliable than others. This strategy helps prioritize the most trusted associations.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, CRM_ID, Device_Type, Login_Timestamp FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Confidence_Score DESC) AS rn FROM ( SELECT *, -- Example confidence scoring based on frequency and consistency COUNT(*) * CASE WHEN Device_Type = 'Desktop' THEN 1.5 ELSE 1 END AS Confidence_Score FROM example_dataset GROUP BY ECID, CRM_ID, Device_Type, Login_Timestamp ) scored_data ) ranked_data WHERE rn = 1; --Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
This query creates a cleaned dataset by selecting the highest-confidence record for each unique ECID
, effectively reducing profile overlap and potential data duplication. It starts by calculating a Confidence_Score
for each record in the example_dataset
based on the frequency of occurrences and device type, with desktop interactions receiving a higher weight. Next, it ranks each record within each ECID
group according to this confidence score, assigning the top rank to the highest-scoring entry. Finally, the query filters out only the top-ranked record for each ECID
, ensuring that each profile is represented by a single, most reliable entry. This approach helps to minimize profile collapse by excluding lower-confidence records and prioritizing the most relevant data for each unique customer.
Time-Based Segmentation
Description: Retain records based on a time-based rule, such as keeping data within a specific time window (e.g., last six months). Older data can be archived or discarded.
Use Case: Suitable when the most recent data is more relevant than historical data, or when you want to reduce data volume while keeping the latest information.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, CRM_ID, Device_Type, Login_Timestamp FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp DESC) AS rn FROM example_dataset WHERE Login_Timestamp >= DATEADD(month, -12, CURRENT_DATE) ) ranked_data WHERE rn = 1; -- Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Combining Data Using Aggregation Rules
Description: Aggregate certain fields for profiles that share the same
ECID
, combining data using aggregation functions (e.g., summing transaction counts, taking the latest email address).Use Case: Useful for combining behavioral data while still allowing for unique identifiers to coexist within a merged profile.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE aggregated_dataset AS SELECT ECID, MAX(Login_Timestamp) AS Last_Login, COUNT(*) AS Interaction_Count, STRING_AGG(DISTINCT Device_Type, ', ') AS Device_Types FROM example_dataset GROUP BY ECID; --Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Applying Domain-Specific Rules
Description: Use business-specific rules to decide which records to keep. For example, prioritize records from specific CRM systems (e.g., always keep records from a particular system if multiple sources are integrated).
Use Case: Effective in organizations with a well-defined hierarchy of data sources, where some data sources are known to be more reliable than others.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT * FROM example_dataset WHERE CRM_ID IN ( SELECT CRM_ID FROM trusted_sources --make sure you have a trusted source table ); --Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
Creating Composite Identifiers for Resolution
Description: Combine multiple fields (e.g.,
ECID
+Device_Type
) to create a composite identifier for deduplication. This strategy adds additional granularity to the process.Use Case: Helpful in distinguishing between profiles that share the same
ECID
but differ in other aspects like device type or location.Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE cleaned_dataset AS SELECT ECID, CRM_ID, Device_Type, Login_Timestamp FROM ( SELECT ECID, CRM_ID, Device_Type, Login_Timestamp, ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp DESC) AS final_rn FROM ( -- First deduplication by ECID and Device_Type SELECT ECID, CRM_ID, Device_Type, Login_Timestamp, ROW_NUMBER() OVER (PARTITION BY ECID, Device_Type ORDER BY Login_Timestamp DESC) AS device_rn FROM example_dataset ) deduped_by_device WHERE device_rn = 1 ) final_deduped WHERE final_rn = 1; -- Test to see if there is profile collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1;
The above query aims to clean up a dataset to prevent profile collapse by ensuring that each ECID
(a unique customer identifier) is represented by only one record, even if there are multiple entries across different devices or CRM IDs. This is achieved through a two-step deduplication process. In the first step, we use a subquery to partition the data by both ECID
and Device_Type
, ordering by Login_Timestamp
in descending order to keep only the most recent record for each unique combination of ECID
and Device_Type
. This intermediate result ensures that for each ECID
, only the latest activity per device type is retained. In the second step, an outer query applies another ROW_NUMBER()
partitioned solely by ECID
, ordering again by Login_Timestamp
in descending order. By selecting only the top-ranked record (final_rn = 1
), the query retains only the latest entry per ECID
across all device types. This two-layered approach effectively removes duplicate entries and prevents profile collapse by consolidating each ECID
into a single, most recent record, providing a clean and unified dataset.
Clustering Techniques for Grouping Similar Profiles
Description: Use clustering algorithms (e.g., k-means, hierarchical clustering) to group similar profiles and resolve which ones are likely to represent the same individual.
Use Case: Useful when there are subtle differences in data attributes across profiles, and statistical methods can help identify groups that should be merged.
Implementation Example: This approach would be executed inside Data Distiller Advanced Statistics and Machine Learning features.
To address profile collapse, our clustering should identify profiles that are likely to belong to the same individual based on shared or similar identifiers. Here’s an approach focused on preventing profile collapse:
Device ID Consistency:
Count the number of unique
Device_Type
values associated with eachECID
andCRM_ID
.Profiles with a wide variety of device types may indicate multiple users on shared devices, which could contribute to profile collapse.
device_type_variety
: Counts the distinctDevice_Type
s used by eachECID
. High variety here may indicate a shared device or cross-identity usage.
Shared CRM Identifiers:
Calculate the frequency of each
CRM_ID
perECID
. High frequencies might indicate cases where multiple profiles with the sameCRM_ID
are collapsed into one.A high count of distinct
CRM_ID
s perECID
suggests possible identity collisions (e.g., a B2B and a B2CCRM_ID
on the sameECID
).crm_id_count
: Counts distinctCRM_ID
s perECID
, helping to detect cases where multiple identities are merged into one, indicating a potential profile collapse.
Login Recency:
While not as direct, recency can still help detect anomalies where profile collapse might have caused inactive or mismatched data.
login_recency
: While not a direct indicator, it can provide context for activity levels, which may be useful if certain collapsed profiles appear inactive compared to expected activity.
By clustering based on these features, you’re more likely to identify clusters where ECID
s are artificially merged due to overlapping CRM_ID
s or device types. Profiles within the same cluster that have multiple CRM_ID
s could be flagged for further review or disambiguation, effectively isolating cases where profile collapse is likely occurring.
-- Upper threshold on the number of clusters
SELECT COUNT(DISTINCT ECID) FROM example_dataset;
-- Create the K-Means Model
CREATE OR REPLACE MODEL profile_collapse_clustering
TRANSFORM(vector_assembler(array(device_type_variety, crm_id_count, login_recency)) AS features)
OPTIONS (
MODEL_TYPE = 'KMEANS',
NUM_CLUSTERS = 1000, -- Adjust the number of clusters based on dataset size
MAX_ITER = 20 -- Set maximum iterations
)
AS
SELECT
device_type_variety,
crm_id_count,
login_recency
FROM
(SELECT
ECID,
COUNT(DISTINCT Device_Type) AS device_type_variety,
COUNT(DISTINCT CRM_ID) AS crm_id_count,
DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency
FROM example_dataset
GROUP BY ECID);
-- Create a table to store the clusters and predictions for detecting profile collapse
CREATE TABLE IF NOT EXISTS ecid_cluster AS
SELECT *
FROM MODEL_PREDICT(profile_collapse_clustering, 1, (
SELECT
ECID,
COUNT(DISTINCT Device_Type) AS device_type_variety, -- Number of unique devices per ECID
COUNT(DISTINCT CRM_ID) AS crm_id_count, -- Number of unique CRM_IDs per ECID
DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency -- Recency of last login
FROM example_dataset
GROUP BY ECID
));
-- Read the values from ecid_cluster
SELECT * FROM ecid_cluster;


Machine Learning-Based Resolution
Description: Train a machine learning model (e.g., logistic regression, decision trees) to predict whether two profiles should be merged based on features like similarity of CRM_IDs, login patterns, or device types.
Use Case: Ideal for complex datasets where simple rules don't capture the nuances, and historical labeled data is available to train a model.
Implementation Example: Use a machine learning framework to train the model, then apply the model predictions in SQL to clean up the dataset.
Assume that:
We have historical data with labeled pairs of
ECID
s and their features (CRM_ID similarity
,device type overlap
,login frequency similarity
, etc.).We use logistic regression for binary classification to predict “merge” or “do not merge” for each profile pair.
Let us first generate the feature engineering dataset:
-- Step 1: Calculate login counts per ECID
CREATE TEMP TABLE profile_logins AS
SELECT
ECID,
CRM_ID,
Device_Type,
Login_Timestamp,
COUNT(*) OVER (PARTITION BY ECID) AS login_count
FROM example_dataset;
-- Step 2: Generate labeled pairs by self-joining on ECID pairs
CREATE TEMP TABLE labeled_pairs AS
SELECT
p1.ECID AS ECID1,
p2.ECID AS ECID2,
-- Feature: CRM_ID Similarity (1 if similar, 0 if not)
CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS crm_id_similarity,
-- Feature: Device Type Overlap (1 if similar, 0 if not)
CASE WHEN p1.Device_Type = p2.Device_Type THEN 1 ELSE 0 END AS device_type_overlap,
-- Feature: Login Frequency Similarity (absolute difference)
ABS(p1.login_count - p2.login_count) AS login_frequency_diff,
-- Example Merge Label (1 if CRM_IDs match, 0 otherwise)
CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS merge_label
FROM profile_logins p1
JOIN profile_logins p2
ON p1.ECID < p2.ECID -- Ensure unique pairs and avoid self-pairing
ORDER BY p1.ECID, p2.ECID;
-- Step 3: Check the result
SELECT * FROM labeled_pairs LIMIT 10;
Train the logistics regression model:
CREATE OR REPLACE MODEL profile_merge_model
TRANSFORM(vector_assembler(array(crm_id_similarity, device_type_overlap, login_frequency_diff)) AS features)
OPTIONS (
MODEL_TYPE = 'LOGISTIC_REG',
LABEL = 'merge_label' -- Logistic Regression for binary classification
) AS
SELECT
crm_id_similarity,
device_type_overlap,
login_frequency_diff,
merge_label
FROM labeled_pairs;
Manual Review for High-Risk Collapses
Description: Identify high-risk profile collapses (e.g., large discrepancies in profile attributes) and flag them for manual review.
Use Case: When automated processes cannot resolve all cases with high confidence, manual intervention may be required for certain profiles.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; -- Step 1: Test for Profile Collapse SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count FROM cleaned_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 1; -- Step 2: If profile collapse is detected, create a table of affected ECIDs CREATE TEMP TABLE collapsed_profiles AS SELECT ECID FROM example_dataset GROUP BY ECID HAVING COUNT(DISTINCT CRM_ID) > 2; -- Step 3: Compile records needing manual review based on identified ECIDs CREATE TEMP TABLE manual_review AS SELECT * FROM example_dataset WHERE ECID IN (SELECT ECID FROM collapsed_profiles);
Confidence-Based Merging of Multiple Strategies
Description: Combine multiple strategies by assigning confidence levels to each and merging profiles based on the highest combined confidence.
Use Case: Ideal when no single rule can address all cases effectively, allowing a multi-strategy approach to improve resolution accuracy.
Implementation Example: Create multiple confidence scores based on different rules, then aggregate these to determine the final outcome.
Fuzzy Matching Techniques for Inexact Data
Description: Apply fuzzy matching on attributes like
CRM_ID
or name fields to identify records that are close matches but not exact. This can help clean up cases where variations in identifiers exist.Use Case: Helpful when there are data entry errors or slight differences in CRM_ID formats that contribute to profile collapses.
Implementation Example: Use Data Distiller preprocess the data with fuzzy matching techniques and then apply cleaned values in SQL. There is a detailed tutorial for this located here.
Audit-Based Reconciliation
Description: Implement auditing rules where profiles with certain characteristics (e.g., frequent logins from different devices) are automatically flagged for reconciliation checks.
Use Case: Ensures ongoing monitoring of data integrity by routinely checking for signs of potential profile collapses.
Implementation Example: Make sure you execute the code blocks one by one
DROP TEMP TABLE IF EXISTS cleaned_dataset; CREATE TEMP TABLE flagged_for_audit AS SELECT ECID, COUNT(DISTINCT Device_Type) AS Distinct_Device_Count FROM example_dataset GROUP BY ECID HAVING Distinct_Device_Count > 3;
Choosing the Right Strategy
The appropriate strategy will depend on factors such as:
Data Quality: How consistent and accurate is the CRM_ID data?
Business Rules: Are there clear rules for prioritizing certain records over others?
Data Usage: Will all the attributes be needed, or can some be safely discarded?
Combining multiple strategies may also be necessary, such as merging attributes for certain profiles while prioritizing the latest records for others. Testing different approaches on sample datasets can help in choosing the optimal strategy for your specific scenario
Validating the Cleanup
To ensure the cleanup worked, always run a check to see if any ECID
still has multiple CRM_IDs
:
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;
If this query returns no results, the cleanup was successful.
Finalizing the Derived Dataset
If the cleaned dataset meets your criteria, you can replace the original dataset with the cleaned version or store it as a new dataset for further processing.
Last updated