Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Overview
  • Levenshtein Distance
  • Identify Fuzzy Matches in CRM_ID using LEVENSHTEIN distance
  • Select Best Match Based on the Highest Similarity Score
  • Create a Cleaned Dataset
  • Verify the Cleaned Dataset
  1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller

Learn how to apply fuzzy matching with Data Distiller to improve accuracy in identity resolution and profile management.

Last updated 6 months ago

Prerequisites

For this tutorial, you will need to ingest the following dataset:

using the steps outlined in:

We will also be using DBVisualizer to extract large volumes of data directly onto our machine from the Data Distiller backend:

Overview

Fuzzy matching is a technique used to identify and link records that are similar but not exact matches, such as CRM_IDs or name fields with slight variations. By applying fuzzy matching, you can address discrepancies arising from data entry errors or inconsistent formatting, which are common causes of profile collapse in customer databases. When records are mistakenly treated as separate due to small variations, fuzzy matching can help consolidate them, leading to more accurate identity resolution.

Fuzzy matching is particularly valuable when dealing with data entry errors or subtle differences in identifier formats that cause profile fragmentation or collapse. For instance, CRM_IDs that differ by one or two characters may represent the same individual, but these small discrepancies prevent a system from recognizing them as such. By applying fuzzy matching, you can detect and merge these near-duplicates, improving the quality and continuity of the customer profiles.

The implementation involves preprocessing the data with fuzzy matching techniques using Data Distiller. This initial processing phase standardizes and identifies close matches within CRM_IDs or name fields. The cleaned and standardized values are then applied in SQL to create a consolidated dataset, resulting in more accurate profile records and reducing instances of incorrect merges or data fragmentation. This multi-step approach combines fuzzy matching capabilities with SQL’s power to efficiently handle large datasets, enhancing overall data accuracy and reliability.

Levenshtein Distance

The Levenshtein distance is a way of measuring how different two words (or strings) are from each other. Imagine you have two words, and you want to see how many small changes it would take to turn one word into the other.

Each change could be:

  1. Inserting a new letter (e.g., turning "cat" into "cart" by adding "r").

  2. Deleting a letter (e.g., turning "cart" into "cat" by removing "r").

  3. Replacing one letter with another (e.g., turning "cat" into "bat" by changing "c" to "b").

The Levenshtein distance counts the minimum number of these small changes needed to make the two words exactly the same. So, if two words are very similar, like "cat" and "bat," the distance is small (only one change). If they're quite different, like "cat" and "elephant," the distance is much larger because you'd need many changes.

In essence, Levenshtein distance gives us a simple way to measure how "close" or "far" two words are from each other based on the number of changes required. It's often used in spell-checkers or in finding similar records in databases to help match up entries that might be slightly different due to typos or inconsistencies.

Identify Fuzzy Matches in CRM_ID using LEVENSHTEIN distance

In the SQL example below, fuzzy matching is applied to identify records with CRM_IDs that are nearly identical, even if they differ slightly due to minor data entry errors or formatting inconsistencies. The query leverages the Levenshtein distance function to calculate the edit distance between pairs of CRM_IDs, which measures the minimum number of single-character edits (insertions, deletions, or substitutions) needed to make one CRM_ID identical to another. By setting a threshold of 2, the query identifies CRM_ID pairs that have minor variations—indicating that they may refer to the same entity but were inconsistently recorded. This approach is particularly useful in cases where exact matches fail to capture all duplicates due to slight discrepancies. By storing these potential matches in a temporary table, fuzzy_matches, the process allows for a detailed review or automated cleanup to merge or consolidate profiles, ultimately improving the accuracy and integrity of the dataset.

Select Best Match Based on the Highest Similarity Score

In the query below, the goal is to identify the best match for each ECID based on the highest similarity score between CRM_IDs, using our fuzzy matching algorithm. The query operates on the principle of minimizing the Levenshtein distance (called edit distance) between CRM_ID pairs within each ECID group. By finding the smallest possible crm_id_similarity_score, we capture the closest match—meaning the pair with the least number of character edits needed to make the CRM_IDs identical.

The subquery (SELECT MIN(crm_id_similarity_score)...) determines this closest match by selecting the smallest crm_id_similarity_score for each ECID1, representing the record with the highest similarity. The primary query then filters fuzzy_matches to include only those pairs whose similarity score is equal to this minimum value, effectively creating best_matches. This temporary table stores each ECID and its closest matching record, allowing for precise consolidation based on the closest possible CRM_ID values. By focusing on the minimum edit distance, the query ensures that only the best match is selected for each ECID, thus refining identity resolution and reducing the chance of incorrect profile merges.

-- Step 2: Select the best match for each ECID pair based on the highest similarity score
CREATE TEMP TABLE best_matches AS
SELECT ECID1, ECID2, CRM_ID1, CRM_ID2
FROM fuzzy_matches
WHERE crm_id_similarity_score = (
    SELECT MIN(crm_id_similarity_score)
    FROM fuzzy_matches f
    WHERE f.ECID1 = fuzzy_matches.ECID1
);

SELECT * FROM best_matches;

Create a Cleaned Dataset

To create the cleaned dataset, use the following:

-- Step 3: Create a new cleaned dataset with preferred CRM_ID for each ECID
CREATE TEMP TABLE cleaned_dataset AS
SELECT 
    a.ECID,
    COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID,  -- Use preferred CRM_ID from best_matches if available
    a.Device_Type,
    a.Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

SELECT * FROM cleaned_dataset;

In the query above, the goal is to create a cleaned dataset where each ECID is associated with a preferred CRM_ID, selected based on the closest match identified in the best_matches table.

The query works on the principle of data standardization and preference selection—it uses fuzzy matching results to replace potentially inconsistent or duplicate CRM_IDs with the most representative one for each ECID. Here’s how it achieves this:

  1. COALESCE Selection: The query applies the COALESCE(b.CRM_ID1, a.CRM_ID) function, which takes the CRM_ID from best_matches (if available) as the preferred identifier. COALESCE ensures that if there is no match in best_matches, the original CRM_ID from example_dataset (a.CRM_ID) is retained. This means that for each ECID, the system first looks for a refined CRM_ID and defaults to the original one if no match exists.

  2. LEFT JOIN: By performing a LEFT JOIN between example_dataset (a) and best_matches (b) on ECID, the query ensures that all records in example_dataset are preserved. Only records with a corresponding ECID in best_matches will have the CRM_ID replaced, making the cleaned dataset comprehensive while preserving unmatched entries.

  3. Resulting Cleaned Dataset: The cleaned_dataset now contains records where each ECID is linked to the best possible CRM_ID, improving data consistency by standardizing identifiers based on the closest match.

The results are:

To remove duplicate records in the cleaned_dataset based on ECID while retaining only one record per ECID, you can use a ROW_NUMBER() function to rank the records within each ECID group, then select only the top-ranked record. This method ensures that duplicates are filtered out, leaving only one preferred CRM_ID for each ECID.

-- Step 2a: Create a dataset with preferred CRM_IDs and rank duplicates for removal
CREATE TEMP TABLE ranked_cleaned_dataset AS
SELECT 
    a.ECID,
    COALESCE(b.CRM_ID1, a.CRM_ID) AS CRM_ID,  -- Use preferred CRM_ID from best_matches if available
    a.Device_Type,
    a.Login_Timestamp,
    ROW_NUMBER() OVER (PARTITION BY a.ECID ORDER BY a.Login_Timestamp DESC) AS rn  -- Rank records within each ECID by Login_Timestamp
FROM example_dataset a
LEFT JOIN best_matches b ON a.ECID = b.ECID2;

-- Step 2b: Select only the top-ranked record for each ECID to remove duplicates
CREATE TEMP TABLE cleaned_dataset AS
SELECT ECID, CRM_ID, Device_Type, Login_Timestamp
FROM ranked_cleaned_dataset
WHERE rn = 1;

Verify the Cleaned Dataset

You can preview the results especially for this small dataset:

The reduced row count in the result set (9458 instead of the original 10,000) indicates that some records were filtered out as duplicates when we applied the ROW_NUMBER() function. This reduction occurred because multiple records with the same ECID were consolidated, keeping only the top-ranked (most recent) record for each ECID. In other words, many ECIDs in the original example_dataset had multiple records with different CRM_IDs or duplicate entries based on Login_Timestamp. When applying ROW_NUMBER() and keeping only rn = 1, we retained only one unique record per ECID based on the most recent Login_Timestamp. Therefore, if there were originally 10,000 rows but many duplicate ECIDs, filtering down to the top-ranked rn = 1 record for each unique ECID results in only 9458 unique ECIDs in cleaned_dataset.

-- Step 3: Verify the result by checking for profile collapse
SELECT ECID, COUNT(DISTINCT CRM_ID) AS Distinct_CRM_Count
FROM cleaned_dataset
GROUP BY ECID
HAVING COUNT(DISTINCT CRM_ID) > 1;

This shows that there are no duplicates:

PREP 500: Ingesting CSV Data into Adobe Experience Platform
PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
404KB
example_dataset.csv
Results of the fuzzy match using the Levenshtein distance.
Highest similarity scoring matches.
Cleaned dataset shows duplicate records.
If you scroll down the resultset, you will only see 9458 rows.
There are no duplicates in the 9458 identity matches.
Page cover image