IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller

Mastering profile cleanup transforms data chaos into clarity, enabling accurate, unified real-time customer profiles with 15+ algorithms.

Prerequisites

For this tutorial, you will need to ingest the following dataset:

404KB

example_dataset.csv

using the steps outlined in:

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Understanding Profile Collapse

Here is a recap of the fundamental concepts of Adobe Experience Platform. Data enters the Adobe Experience Platform through edge, streaming, and batch methods. Regardless of the ingestion mode, all data must find its place in a dataset within the Platform Data Lake. The ingestible data falls into three categories: attributes, events, and lookups. The Real-Time Customer Profile operates with two concurrent stores – a Profile Store and an Identity Store. The Profile Store takes in and partitions data based on the storage key, which is the primary identity. Meanwhile, the Identity Store continuously seeks associations in among identities, including the primary one, within the ingested record, utilizing this information to construct the identity graph. These two stores, accessing historical data in the Platform Data Lake, open avenues for advanced modeling, such as RFM, among other techniques.

Deterministic and probabilistic identity resolution are two methods used to match and merge customer identities across data sources, each with its own trade-offs.

Deterministic identity resolution relies on exact matches of unique identifiers, such as email addresses, phone numbers, or customer IDs. This approach is highly accurate when the data is consistent, ensuring a precise link between records. However, its limitation lies in the need for identical identifiers across systems, which can reduce match rates when variations in data exist. Adobe Experience Platform uses deterministic identity resolution.

On the other hand, probabilistic identity resolution uses statistical algorithms and data attributes (e.g., name, location, browsing behavior) to calculate the likelihood that different records belong to the same individual. While this approach increases the chances of finding connections even when identifiers differ, it introduces some uncertainty, as matches are based on probability rather than certainty.

Profile collapse in Adobe Experience Platform occurs when data from different individuals is mistakenly merged into a single customer profile. This typically happens due to identity management issues or data inconsistencies during the process of data ingestion and identity resolution.

Data can enter Adobe Experience Platform (AEP) via edge, streaming, or batch ingestion methods. Regardless of how it arrives, all data is ultimately stored within the Platform Data Lake and categorized as attributes, events, or lookups. To control how the identity algorithm processes data, you need to prepare the data before it enters the Identity Store.

There are two scenarios for data handling:

Batch Ingestion: In this case, data should be processed by Data Distiller and then ingested into both stores. This approach works seamlessly, as Data Distiller can handle these jobs in either full or incremental processing modes.
Real-Time Ingestion (Edge or Streaming): In this scenario, processing cannot occur immediately because you need to act on data in real time. However, all real-time data is stored in the Data Lake, allowing for periodic identity operations when system downtime is possible, preferably during low-traffic periods. To perform these operations, follow these steps:
- Disable the existing dataflow for Dataset A, where real-time data is currently being ingested.
- Unmark Dataset A for Profile.
- Apply the identity algorithm operations to Dataset A, creating a new Dataset B and marking it for Profile. Ensure Dataset B is empty and marked for Profile before hydrating it.
- Set up a new dataflow from the same source to Dataset B. This setup will ensure that both historical and new streaming data are ingested seamlessly.

Shared Devices or Browsers

Example: When multiple users share the same device or browser, the same Experience Cloud ID (ECID) is used for all interactions.
How It Causes Profile Collapse: Since the ECID remains constant across different users, AEP may mistakenly associate all interactions with a single profile, merging data from multiple individuals. For example, if two customers log in using the same device, their distinct CRM IDs may both get linked to the same ECID, leading to a merged profile in the Profile Store.

Identity Fragmentation (Multiple Identities for the Same Individual)

Example: A single user may appear under different identifiers across data sources or channels, such as different CRM IDs for B2B and B2C interactions, different emails, or multiple login credentials.
How It Causes Profile Collapse: When AEP tries to link these identities, it may merge data from different identifiers into a single profile. If the linking is done inconsistently or incorrectly, it can lead to a collapsed profile where data from different roles or personas of the same individual gets combined inappropriately.

Duplicate Records

Example: Poor data quality can lead to duplicate records where the same person is listed multiple times with slight variations in the data.
How It Causes Profile Collapse: If these duplicate records are not accurately distinguished, AEP's identity resolution may treat them as separate initially but later merge them into one profile, resulting in collapsed data..

Cross-Device or Cross-Channel Tracking

Example: Users interact across multiple devices or channels, generating different identifiers (e.g., cookies, mobile IDs, ECIDs).
How It Causes Profile Collapse: If identity resolution does not accurately map these identifiers back to the same person, data from different users who share similar device or channel characteristics may be incorrectly merged, causing profile collapse.

Shared Network Environments (e.g., Public Wi-Fi)

Example: Multiple users on the same network (e.g., public Wi-Fi) can appear to share similar network-based identifiers, such as IP addresses.
How It Causes Profile Collapse: If network-based identifiers are used as part of the identity resolution, this can result in incorrect associations between different users on the same network, leading to merged profiles.

Using IP addresses as identifiers in profile resolution can be problematic because they do not uniquely represent individual users. Here are several reasons why IP addresses can lead to inaccurate identity resolution and profile collapse:

Shared Networks: Multiple users often share the same IP address when they are on the same network, such as public Wi-Fi in a coffee shop, office network, or a household router. In these scenarios, an IP-based identifier may incorrectly group different users as one, resulting in profile collapse.
Dynamic IP Addresses: Many internet service providers (ISPs) assign dynamic IP addresses, meaning a user’s IP can change over time. For instance, when a user reconnects to the internet or moves between networks, they might receive a new IP address. This variability can lead to fragmented profiles for the same user or incorrect matches with other users.
Proxy Servers and VPNs: Users who access the internet via a proxy server or a virtual private network (VPN) can share the same IP address across multiple devices and locations. This introduces further ambiguity, as users from entirely different networks can appear to have identical IP addresses, complicating identity resolution.
Geographic Misinterpretations: IP addresses can sometimes be misleading in terms of location, especially with mobile carriers or large ISPs that route traffic through different hubs. Users may appear to be connecting from one location even if they are in a completely different one, which can skew profile data and misrepresent user behavior.
IP Address Rotation and NAT: Many large networks, especially in enterprise or cellular environments, use Network Address Translation (NAT), allowing multiple devices to share a single public IP address. In these cases, hundreds or thousands of users may appear to be connecting from the same IP, making it an unreliable identifier for individual profiles.

Third-Party Data with Inconsistent Identity Keys

Example: Integrating third-party data with different identity keys (e.g., hashed emails, mobile IDs) can complicate the identity resolution process.
How It Causes Profile Collapse: If the third-party identity keys are not consistently linked to primary identifiers in AEP, data from different individuals may be mistakenly merged.

Keeping prospect data separate from existing customer data in Adobe Experience Platform is crucial for several reasons, primarily around data accuracy, compliance, and targeted engagement. Here’s a breakdown of why this separation is important:

Different Data Quality and Attributes: Prospect data often lacks the depth and reliability of existing customer data, as it may come from third-party sources, form fills, or inferred behavior. Mixing it with verified customer data can dilute the accuracy of customer profiles and lead to mistaken assumptions or merges.
Targeted Engagement Strategies: Prospects and existing customers are at different stages in the customer journey and require distinct engagement strategies. Keeping them separate allows businesses to personalize messaging based on the user’s status—whether they are a potential customer needing nurturing or an existing customer who may benefit from cross-selling or loyalty programs.
Compliance and Privacy Requirements: Privacy regulations often impose stricter handling requirements for prospect data, as it may include inferred interests or demographics without direct consent. By isolating prospect data, you can manage it according to specific data handling policies, reducing the risk of inadvertently using unconsented data in customer-specific actions or analyses.
Avoiding Profile Collapse and Data Pollution: If prospect data is integrated with customer data prematurely, it increases the likelihood of profile collapse, where profiles may merge incorrectly due to weak or partial identifiers. This mixing can lead to inaccurate profiling, mistaken identity resolution, and ineffective targeting. Keeping the data separate helps maintain clean, verified customer records.
Flexible and Scalable Identity Resolution: By separating prospects, organizations can apply tailored identity resolution processes, especially if the prospect data has a different set of identifiers or is less frequently updated. This approach ensures that identity resolution can be scaled and adjusted for prospects without impacting customer data accuracy.

Different Data Sources with Varying Update Frequencies

Example: Data sources may update at different intervals, such as real-time data streams versus weekly CRM updates.
How It Causes Profile Collapse: If identities are linked or unlinked based on out-of-sync information, the Identity Store may merge profiles incorrectly.

Incorrectly Implemented Identity Resolution Logic in Ingested Data

Example: Errors in the data ingestion process can send incorrect identity mapping information to AEP.
How It Causes Profile Collapse: These mistakes can lead to improper associations between identifiers, causing unrelated profiles to be merged.

Why Profile Collapse Matters

Profile collapse impacts the accuracy and effectiveness of customer data by:

Incorrect Personalization: Merging unrelated data results in irrelevant or misleading personalization, causing customers to receive content or offers that don't apply to them.
Data Quality and Analysis Issues: Aggregated metrics may reflect combined behaviors of different individuals, leading to skewed analysis and inaccurate insights.
Privacy Risks: Mixing data from multiple users can unintentionally expose personal information to the wrong individual.

To minimize the risk of profile collapse, consider:

Refining Identity Matching Rules: Use more precise matching criteria and multiple attributes to accurately resolve identities.
Improving Data Quality: Address duplicate records and inconsistencies at the source before ingestion.
Configuring Identity Graphs Carefully: Ensure identity resolution rules and graph configurations account for different identity sources and their unique characteristics.
Regular Monitoring and Auditing: Implement checks to detect and rectify potential profile collapses.

In the context of profile collapse, the trade-off between deterministic and probabilistic identity resolution revolves around balancing accuracy and coverage.

Deterministic methods offer high accuracy by linking profiles only when there are exact matches on unique identifiers, reducing the risk of mistakenly merging different individuals but potentially leaving some profiles fragmented if identifiers are inconsistent or missing.

On the other hand, probabilistic methods expand coverage by using statistical algorithms to match records based on similarities in behaviors, patterns, or non-unique identifiers. While this approach can unify more profiles even when exact matches are unavailable, it also increases the likelihood of false positives, leading to profile collapse where data from different individuals is mistakenly combined.

To mitigate this, typically a hybrid approach can be used, starting with deterministic matches and then leveraging probabilistic methods to fill in gaps, aiming to balance the need for comprehensive identity resolution with the risk of inaccurate profile merging.

Querying the Data to Identify Collapsed Profiles

First, we will identify the profiles that are collapsed, based on the presence of different CRM_IDs for the same ECID.

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This query identifies the ECID values associated with multiple CRM_IDs, indicating potential collapsed profiles.

Mitigating Profile Collapse Strategies

When cleaning datasets to resolve collapsed profiles caused by multiple CRM_IDs linked to the same ECID, there are various rule-based strategies that can be employed. The choice of strategy depends on business requirements and the characteristics of the data. Here are some possible rule-based strategies to resolve such profile collapses:

Keeping the Latest Record

Description: This strategy focuses on retaining the most recent record for each ECID by using the Login_Timestamp to identify the latest activity.

Use Case: This approach is valuable when the latest activity is considered the most accurate or relevant representation of a user profile. By keeping only the most recent data, you ensure that the profile reflects the most current user information, excluding outdated or redundant records that may no longer be valid.

Implementation Example: Execute the following code blocks sequentially to implement this method, making sure to run each step individually to maintain data integrity:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Latest Record Algorithm
CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset a
WHERE Login_Timestamp = (
    SELECT MAX(Login_Timestamp)
    FROM example_dataset b
    WHERE a.ECID = b.ECID
);

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Result is not shown as it should return no rows as shown below:

In the algorithms below, we won’t display the screen with empty rows, but if the algorithms function correctly, a screen with no rows returned as shown above is the expected outcome.

Keeping the Earliest Record

Description: This strategy retains only the earliest record for each ECID, determined by the Login_Timestamp.
Use Case: This approach is useful when the initial identification or first interaction with a profile is considered the most reliable, ensuring that any subsequent, potentially inconsistent updates do not affect the profile’s original data.
Implementation Example: Execute each code block one at a time to avoid conflicts and ensure data integrity:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Earliest Record Algorithm
CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset a
WHERE Login_Timestamp = (
    SELECT MIN(Login_Timestamp)
    FROM example_dataset b
    WHERE a.ECID = b.ECID
);

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Prioritizing a Specific CRM_ID Type

Description: This strategy retains records based on a preferred CRM_ID type, giving priority to a specific type (e.g., always keeping B2C over B2B).
Use Case: This approach is helpful when certain customer relationship types are more important for analysis or operations. For example, retail customers (B2C) may be prioritized over business customers (B2B) to focus on individual consumer behavior in retail environments.
Implementation Example: Execute each code block sequentially to ensure the correct data is retained without conflicts:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Prioritizing a Specific Identity Type
CREATE TABLE cleaned_dataset AS
SELECT *
FROM example_dataset
WHERE CRM_ID LIKE 'B2C_%';

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Resolving Based on a Scoring System

Description: Apply a scoring system to rank profiles by various attributes (such as login recency, interaction count, or device type), then retain the highest-scoring record for each ECID.
Use Case: Ideal when determining the "best" record requires multiple criteria, offering a more refined method for resolving collapsed profiles.
Implementation Example: Execute each code block in sequence to ensure accurate and conflict-free data processing:

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Score DESC) AS rn
    FROM (
        SELECT *, 
               -- Scoring logic: more recent logins and 'Desktop' device type score higher
               CASE 
                   WHEN Device_Type = 'Desktop' THEN 10 ELSE 5 END + 
                   DATEDIFF(day, '2024-10-01', Login_Timestamp) AS Score
        FROM example_dataset
    ) scored_data
) ranked_data
WHERE rn = 1;

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Merging Attributes from Collapsed Records

Description: Instead of discarding any records, merge the attributes from all records associated with the same ECID. This could involve creating lists of values, aggregating metrics, or applying other transformation rules.
Use Case: Suitable when all available information is valuable and should be preserved, such as combining multiple phone numbers or email addresses associated with a profile.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, 
       STRING_AGG(CRM_ID, ', ') AS CRM_ID,
       MAX(Login_Timestamp) AS Last_Login,
       COUNT(DISTINCT Device_Type) AS Unique_Device_Count
FROM example_dataset
GROUP BY ECID;

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Keeping a Combination of the Latest Record by Type

Description: If an ECID is associated with multiple CRM_ID types, keep the latest entry for each type.
Use Case: Useful in scenarios where having both B2B and B2C relationships is important for the same user, and the latest activity for each type is relevant.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY 
                CASE 
                    WHEN SUBSTRING(CRM_ID, 1, 3) = 'B2C' THEN 1
                    ELSE 2
                END, 
                Login_Timestamp DESC) AS rn
    FROM example_dataset
) filtered_data
WHERE rn = 1;

-- Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Removing Profiles with Ambiguous Identity Mapping

Description: If an ECID is linked to different CRM_ID types that cannot be resolved through any of the above rules, these profiles can be flagged or removed for manual review.
Use Case: Suitable for cases where automation cannot confidently resolve the conflicts, requiring human oversight.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Step 1: Identify ambiguous profiles (those with multiple CRM_IDs per ECID)
CREATE TEMP TABLE ambiguous_profiles AS
SELECT ECID
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

-- Step 2: Create the cleaned dataset by excluding ambiguous profiles
CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset
WHERE ECID NOT IN (SELECT ECID FROM ambiguous_profiles);

-- Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Utilizing External Reference Data for Resolution

Description: Reference an external dataset, such as a master customer list, to resolve which CRM_ID should be considered valid.
Use Case: This approach is useful when a trusted external source can guide the resolution, especially when there are well-maintained external records.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT a.*
FROM example_dataset a
JOIN master_customer_list b ON a.CRM_ID = b.Valid_CRM_ID;

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Using Confidence Scores for Identity Resolution

Description: Assign confidence scores to each CRM_ID association based on various factors, such as the number of interactions, consistency of login information, or verification level. Higher confidence scores indicate stronger associations.
Use Case: Useful when multiple identifiers are present, and some are more reliable than others. This strategy helps prioritize the most trusted associations.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
    SELECT *, 
           ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Confidence_Score DESC) AS rn
    FROM (
        SELECT *, 
               -- Example confidence scoring based on frequency and consistency
               COUNT(*) * 
               CASE 
                   WHEN Device_Type = 'Desktop' THEN 1.5 ELSE 1 END AS Confidence_Score
        FROM example_dataset
        GROUP BY ECID, CRM_ID, Device_Type, Login_Timestamp
    ) scored_data
) ranked_data
WHERE rn = 1;

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This query creates a cleaned dataset by selecting the highest-confidence record for each unique ECID, effectively reducing profile overlap and potential data duplication. It starts by calculating a Confidence_Score for each record in the example_dataset based on the frequency of occurrences and device type, with desktop interactions receiving a higher weight. Next, it ranks each record within each ECID group according to this confidence score, assigning the top rank to the highest-scoring entry. Finally, the query filters out only the top-ranked record for each ECID, ensuring that each profile is represented by a single, most reliable entry. This approach helps to minimize profile collapse by excluding lower-confidence records and prioritizing the most relevant data for each unique customer.

Time-Based Segmentation

Description: Retain records based on a time-based rule, such as keeping data within a specific time window (e.g., last six months). Older data can be archived or discarded.
Use Case: Suitable when the most recent data is more relevant than historical data, or when you want to reduce data volume while keeping the latest information.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp DESC) AS rn
    FROM example_dataset
    WHERE Login_Timestamp >= DATEADD(month, -12, CURRENT_DATE)
) ranked_data
WHERE rn = 1;

-- Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Combining Data Using Aggregation Rules

Description: Aggregate certain fields for profiles that share the same ECID, combining data using aggregation functions (e.g., summing transaction counts, taking the latest email address).
Use Case: Useful for combining behavioral data while still allowing for unique identifiers to coexist within a merged profile.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE aggregated_dataset AS
SELECT ECID, 
       MAX(Login_Timestamp) AS Last_Login,
       COUNT(*) AS Interaction_Count,
       STRING_AGG(DISTINCT Device_Type, ', ') AS Device_Types
FROM example_dataset
GROUP BY ECID;

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Applying Domain-Specific Rules

Description: Use business-specific rules to decide which records to keep. For example, prioritize records from specific CRM systems (e.g., always keep records from a particular system if multiple sources are integrated).
Use Case: Effective in organizations with a well-defined hierarchy of data sources, where some data sources are known to be more reliable than others.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT *
FROM example_dataset
WHERE CRM_ID IN (
    SELECT CRM_ID FROM trusted_sources --make sure you have a trusted source table
);

--Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

Creating Composite Identifiers for Resolution

Description: Combine multiple fields (e.g., ECID + Device_Type) to create a composite identifier for deduplication. This strategy adds additional granularity to the process.
Use Case: Helpful in distinguishing between profiles that share the same ECID but differ in other aspects like device type or location.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM (
    SELECT ECID, CRM_ID, Device_Type, Login_Timestamp,
           ROW_NUMBER() OVER (PARTITION BY ECID ORDER BY Login_Timestamp DESC) AS final_rn
    FROM (
        -- First deduplication by ECID and Device_Type
        SELECT ECID, CRM_ID, Device_Type, Login_Timestamp,
               ROW_NUMBER() OVER (PARTITION BY ECID, Device_Type ORDER BY Login_Timestamp DESC) AS device_rn
        FROM example_dataset
    ) deduped_by_device
    WHERE device_rn = 1
) final_deduped
WHERE final_rn = 1;

-- Test to see if there is profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

The above query aims to clean up a dataset to prevent profile collapse by ensuring that each ECID (a unique customer identifier) is represented by only one record, even if there are multiple entries across different devices or CRM IDs. This is achieved through a two-step deduplication process. In the first step, we use a subquery to partition the data by both ECID and Device_Type, ordering by Login_Timestamp in descending order to keep only the most recent record for each unique combination of ECID and Device_Type. This intermediate result ensures that for each ECID, only the latest activity per device type is retained. In the second step, an outer query applies another ROW_NUMBER() partitioned solely by ECID, ordering again by Login_Timestamp in descending order. By selecting only the top-ranked record (final_rn = 1), the query retains only the latest entry per ECID across all device types. This two-layered approach effectively removes duplicate entries and prevents profile collapse by consolidating each ECID into a single, most recent record, providing a clean and unified dataset.

Clustering Techniques for Grouping Similar Profiles

Description: Use clustering algorithms (e.g., k-means, hierarchical clustering) to group similar profiles and resolve which ones are likely to represent the same individual.
Use Case: Useful when there are subtle differences in data attributes across profiles, and statistical methods can help identify groups that should be merged.
Implementation Example: This approach would be executed inside Data Distiller Advanced Statistics and Machine Learning features.

To address profile collapse, our clustering should identify profiles that are likely to belong to the same individual based on shared or similar identifiers. Here’s an approach focused on preventing profile collapse:

Device ID Consistency:
- Count the number of unique Device_Type values associated with each ECID and CRM_ID.
- Profiles with a wide variety of device types may indicate multiple users on shared devices, which could contribute to profile collapse.
- device_type_variety: Counts the distinct Device_Types used by each ECID. High variety here may indicate a shared device or cross-identity usage.
Shared CRM Identifiers:
- Calculate the frequency of each CRM_ID per ECID. High frequencies might indicate cases where multiple profiles with the same CRM_ID are collapsed into one.
- A high count of distinct CRM_IDs per ECID suggests possible identity collisions (e.g., a B2B and a B2C CRM_ID on the same ECID).
- crm_id_count: Counts distinct CRM_IDs per ECID, helping to detect cases where multiple identities are merged into one, indicating a potential profile collapse.
Login Recency:
- While not as direct, recency can still help detect anomalies where profile collapse might have caused inactive or mismatched data.
- login_recency: While not a direct indicator, it can provide context for activity levels, which may be useful if certain collapsed profiles appear inactive compared to expected activity.

By clustering based on these features, you’re more likely to identify clusters where ECIDs are artificially merged due to overlapping CRM_IDs or device types. Profiles within the same cluster that have multiple CRM_IDs could be flagged for further review or disambiguation, effectively isolating cases where profile collapse is likely occurring.

-- Upper threshold on the number of clusters
SELECT COUNT(DISTINCT ECID) FROM example_dataset;

-- Create the K-Means Model
CREATE OR REPLACE MODEL profile_collapse_clustering
TRANSFORM(vector_assembler(array(device_type_variety, crm_id_count, login_recency)) AS features)
OPTIONS (
    MODEL_TYPE = 'KMEANS',
    NUM_CLUSTERS = 1000,             -- Adjust the number of clusters based on dataset size
    MAX_ITER = 20                    -- Set maximum iterations
)
AS 
SELECT 
    device_type_variety,
    crm_id_count,
    login_recency
FROM 
    (SELECT 
        ECID,
        COUNT(DISTINCT Device_Type) AS device_type_variety,
        COUNT(DISTINCT CRM_ID) AS crm_id_count,
        DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency
    FROM example_dataset
    GROUP BY ECID);

-- Create a table to store the clusters and predictions for detecting profile collapse
CREATE TABLE IF NOT EXISTS ecid_cluster AS
SELECT *
FROM MODEL_PREDICT(profile_collapse_clustering, 1, (
    SELECT 
        ECID,
        COUNT(DISTINCT Device_Type) AS device_type_variety,    -- Number of unique devices per ECID
        COUNT(DISTINCT CRM_ID) AS crm_id_count,                -- Number of unique CRM_IDs per ECID
        DATEDIFF(day, MAX(Login_Timestamp), CURRENT_DATE) AS login_recency  -- Recency of last login
    FROM example_dataset
    GROUP BY ECID
));

-- Read the values from ecid_cluster
SELECT * FROM ecid_cluster;

Machine Learning-Based Resolution

Description: Train a machine learning model (e.g., logistic regression, decision trees) to predict whether two profiles should be merged based on features like similarity of CRM_IDs, login patterns, or device types.
Use Case: Ideal for complex datasets where simple rules don't capture the nuances, and historical labeled data is available to train a model.
Implementation Example: Use a machine learning framework to train the model, then apply the model predictions in SQL to clean up the dataset.
Assume that:
- We have historical data with labeled pairs of ECIDs and their features (CRM_ID similarity, device type overlap, login frequency similarity, etc.).
- We use logistic regression for binary classification to predict “merge” or “do not merge” for each profile pair.

Let us first generate the feature engineering dataset:

-- Step 1: Calculate login counts per ECID
CREATE TEMP TABLE profile_logins AS
SELECT 
    ECID,
    CRM_ID,
    Device_Type,
    Login_Timestamp,
    COUNT(*) OVER (PARTITION BY ECID) AS login_count
FROM example_dataset;

-- Step 2: Generate labeled pairs by self-joining on ECID pairs
CREATE TEMP TABLE labeled_pairs AS
SELECT 
    p1.ECID AS ECID1,
    p2.ECID AS ECID2,
    
    -- Feature: CRM_ID Similarity (1 if similar, 0 if not)
    CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS crm_id_similarity,
    
    -- Feature: Device Type Overlap (1 if similar, 0 if not)
    CASE WHEN p1.Device_Type = p2.Device_Type THEN 1 ELSE 0 END AS device_type_overlap,
    
    -- Feature: Login Frequency Similarity (absolute difference)
    ABS(p1.login_count - p2.login_count) AS login_frequency_diff,
    
    -- Example Merge Label (1 if CRM_IDs match, 0 otherwise)
    CASE WHEN p1.CRM_ID = p2.CRM_ID THEN 1 ELSE 0 END AS merge_label

FROM profile_logins p1
JOIN profile_logins p2 
    ON p1.ECID < p2.ECID  -- Ensure unique pairs and avoid self-pairing
ORDER BY p1.ECID, p2.ECID;

-- Step 3: Check the result
SELECT * FROM labeled_pairs LIMIT 10;

Train the logistics regression model:

CREATE OR REPLACE MODEL profile_merge_model
TRANSFORM(vector_assembler(array(crm_id_similarity, device_type_overlap, login_frequency_diff)) AS features)
OPTIONS (
    MODEL_TYPE = 'LOGISTIC_REG', 
    LABEL = 'merge_label'       -- Logistic Regression for binary classification
) AS
SELECT 
    crm_id_similarity,
    device_type_overlap,
    login_frequency_diff,
    merge_label
FROM labeled_pairs;

Manual Review for High-Risk Collapses

Description: Identify high-risk profile collapses (e.g., large discrepancies in profile attributes) and flag them for manual review.
Use Case: When automated processes cannot resolve all cases with high confidence, manual intervention may be required for certain profiles.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

-- Step 1: Test for Profile Collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

-- Step 2: If profile collapse is detected, create a table of affected ECIDs
CREATE TEMP TABLE collapsed_profiles AS
SELECT ECID
FROM example_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 2;

-- Step 3: Compile records needing manual review based on identified ECIDs
CREATE TEMP TABLE manual_review AS
SELECT *
FROM example_dataset
WHERE ECID IN (SELECT ECID FROM collapsed_profiles);

Confidence-Based Merging of Multiple Strategies

Description: Combine multiple strategies by assigning confidence levels to each and merging profiles based on the highest combined confidence.
Use Case: Ideal when no single rule can address all cases effectively, allowing a multi-strategy approach to improve resolution accuracy.
Implementation Example: Create multiple confidence scores based on different rules, then aggregate these to determine the final outcome.

Fuzzy Matching Techniques for Inexact Data

Description: Apply fuzzy matching on attributes like CRM_ID or name fields to identify records that are close matches but not exact. This can help clean up cases where variations in identifiers exist.
Use Case: Helpful when there are data entry errors or slight differences in CRM_ID formats that contribute to profile collapses.
Implementation Example: Use Data Distiller preprocess the data with fuzzy matching techniques and then apply cleaned values in SQL. There is a detailed tutorial for this located here.

Audit-Based Reconciliation

Description: Implement auditing rules where profiles with certain characteristics (e.g., frequent logins from different devices) are automatically flagged for reconciliation checks.
Use Case: Ensures ongoing monitoring of data integrity by routinely checking for signs of potential profile collapses.

Implementation Example: Make sure you execute the code blocks one by one

DROP TEMP TABLE IF EXISTS cleaned_dataset;

CREATE TEMP TABLE flagged_for_audit AS
SELECT ECID, COUNT(DISTINCT Device_Type) AS Distinct_Device_Count
FROM example_dataset
GROUP BY ECID
HAVING Distinct_Device_Count > 3;

Choosing the Right Strategy

The appropriate strategy will depend on factors such as:

Data Quality: How consistent and accurate is the CRM_ID data?
Business Rules: Are there clear rules for prioritizing certain records over others?
Data Usage: Will all the attributes be needed, or can some be safely discarded?

Combining multiple strategies may also be necessary, such as merging attributes for certain profiles while prioritizing the latest records for others. Testing different approaches on sample datasets can help in choosing the optimal strategy for your specific scenario

Validating the Cleanup

To ensure the cleanup worked, always run a check to see if any ECID still has multiple CRM_IDs:

SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

If this query returns no results, the cleanup was successful.

Finalizing the Derived Dataset

If the cleaned dataset meets your criteria, you can replace the original dataset with the cleaned version or store it as a new dataset for further processing.

Last updated 9 months ago