Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Case Study Overview
  • Dataset Strategy for Supporting Sales Team-Specific Rules and Marketing-Level Cohesion
  • A Unified Yet Flexible Dataset Strategy
  • Data Harmonization
  • Assumptions for Deduplication and Prioritization
  • Identify and Remove Duplicates Based on Prioritization and Ranking with Email as Deduplication Dimension
  • Resolve Multiple Roles
  • Prioritize Active Contacts
  • Handle Contact Promotion and Track History
  • Fuzzy Match on email, First Name and Last Name
  1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs

Learn algorithmic techniques for merging, deduplicating, and enriching B2B contact data to create unified, accurate profiles using Data Distiller

Last updated 5 months ago

Prerequisites

Download the following file:

Ingest the file by following this tutorial:

Also we will be using:

Case Study Overview

The case study is about a financial company that is facing significant challenges with contact records across various systems, particularly multiple instances of Salesforce, Adobe CDP, and Marketo. The challenges arise from data fragmentation, lack of standardization, duplication, and governance issues, impacting both marketing and sales teams. Here’s an in-depth look at each challenge and how it affects operations, followed by potential solutions for addressing them using SQL on a contact list dataset:

  1. Fragmented Data Across Multiple Salesforce Instances: Contacts are stored in multiple Salesforce instances. Some users only have access to a subset of instances, resulting in incomplete data visibility. This limits the ability of users (especially in sales and customer service) to see the full history or context of a contact, impacting customer interactions and leading to missed opportunities.

  2. Duplicate Contacts: Duplicate contacts exist within and across multiple instances, creating conflicting data points. Duplicates lead to confusion as different versions of the same contact may contain conflicting details, such as job titles, roles, and companies, which affect marketing and sales targeting.

  3. Data Footprint in Adobe CDP: Only a small subset of users has access to the digital footprint data captured in Adobe CDP. Sales and customer-facing teams lack critical insights derived from digital interactions, reducing their ability to personalize engagements effectively.

  4. Multiple Email Addresses Per User: Adobe CDP has identified multiple email addresses for some users, merging them into a single user record, but this data is not synchronized back to sales systems. Inconsistent email addresses create a fragmented view in sales systems, which may lead to duplicated outreach or incomplete activity history.

  5. Disconnected Marketing Technology Stacks: Multiple marketing stacks (e.g., associated with different instances) are not integrated, preventing a cohesive enterprise-wide campaign view. Marketing messages may become redundant or disconnected as customers are targeted by individual product campaigns instead of a unified brand campaign.

  6. Lack of Governance for Contact Creation: Users can create new contacts in core Salesforce, as long as the email is different, even if the contact exists in another instance.This results in scattered records for the same contact across systems, making it challenging to maintain an accurate, centralized customer profile.

  7. Multiple Roles for Contacts: Contacts may hold multiple roles across organizations, which impacts the type of messaging they should receive. Inconsistent role-based communication strategies can cause confusion, as the same contact might be messaged for multiple roles within different campaigns, leading to mixed messaging.

  8. Tracking Contact Promotions and Role Changes: There’s no enterprise-wide tracking of contacts' role changes or promotions. When a contact transitions to a new role (e.g., promotion or job change), users may inadvertently lose the contact's activity history, limiting the personalization potential for future interactions.

  9. Difficulty with Attribution in Marketing: Contacts spread across multiple Salesforce instances make it hard to track and attribute marketing activities accurately. Inaccurate attribution data impacts budget allocation decisions, making it challenging for marketing teams to understand the effectiveness of their campaigns or optimize spending.

Dataset Strategy for Supporting Sales Team-Specific Rules and Marketing-Level Cohesion

When working with diverse sales teams, each with its unique business rules and priorities, a robust dataset strategy must balance the need for individualized algorithms with the overarching goal of enabling marketing to look across all sales organizations cohesively. This strategy ensures that the data remains harmonized at the schema level but allows for flexibility in processing and prioritizing information to suit both localized needs and enterprise-wide insights. Here’s how such a strategy can be designed:

Custom Algorithms for Each Sales Organization

Each sales team operates with specific business rules and requirements. To support this, we implement custom algorithms for processing the harmonized dataset for each sales organization. These algorithms allow for:

  • Tailoring how data is aggregated, deduplicated, and prioritized based on the sales team's operational focus.

  • Generating attributes unique to that sales organization, such as region-specific metrics or custom scoring models for leads.

  • Aligning with local sales strategies while ensuring the outputs conform to a standard schema for cross-organizational interoperability.

For example, Sales Org A might prioritize customer engagement metrics, while Sales Org B focuses on product affinity scores. These differences are captured in their respective datasets.

Harmonization with a Single Schema

Although the datasets for each sales organization are processed with different algorithms, the outputs adhere to a common schema. This standardized schema ensures that attributes across datasets are aligned and comparable. For instance:

  • Attributes like email, first_name, and purchase_history remain consistent across all datasets.

  • Unique business rules are applied at the processing level but do not compromise the schema's integrity.

This harmonization enables the Profile Store to ingest and manage all datasets seamlessly while retaining each sales organization's specific details.

Ingesting Datasets into the Profile Store

The datasets for each sales organization are ingested into the Profile Store, which serves as the central repository for customer data. Each dataset is preserved as an independent layer within the Profile Store, ensuring that:

  • Marketing and sales teams can reference the datasets individually or collectively.

  • Data lineage is maintained, enabling traceability of attributes back to their originating sales organization.

Dynamic Data Selection Using Merge Policies

The Profile Store enables dynamic selection of datasets through merge policies, which define how datasets are prioritized and combined:

  • Merge Policies by Sales Organization: Individual sales teams can specify merge policies that prioritize their dataset when creating audiences or running segmentation.

  • Cross-Organization Merge Policies: Marketing can define merge policies that aggregate and harmonize datasets using Data Distiller across sales organizations, providing a unified view of customer data.

For example, one merge policy might prioritize the most recent dataset from a specific sales org, while another aggregates the highest-value attributes across all sales orgs.

Flexible Segmentation and Personalization Contexts

Merge policies allow for dynamic person audience creation at a granular level. This ensures:

  • Sales-Specific Views: Sales teams can work within their own datasets, creating audiences and personalization rules that align with their business priorities.

  • Marketing-Level Insights: Marketing can look across all sales organizations by selecting merge policies that harmonize datasets, enabling segmentation and personalization at the enterprise level.

By switching merge policies, teams can seamlessly transition between localized and global perspectives without altering the underlying datasets.

Profile Snapshots for Merge Policies

To ensure operational flexibility and avoid conflicts between teams, Profile Snapshots are created for each merge policy. These snapshots capture:

  • The state of the unified profile dataset under a specific merge policy at a given time.

  • A static, reproducible dataset for segmentation, analysis, or experimentation without disrupting live datasets.

For example, marketing can generate a snapshot using a cross-organization merge policy for a campaign, while a sales org uses a snapshot of its own dataset for a regional initiative.

A Unified Yet Flexible Dataset Strategy

This strategy allows each sales organization to operate within its unique business rules while maintaining a harmonized data foundation. The use of custom algorithms ensures localized relevance, while the standardized schema and Profile Store enable enterprise-wide cohesion. Merge policies and Profile Snapshots provide the flexibility needed for segmentation and personalization at both the sales and marketing levels. This approach empowers the organization to balance tailored sales strategies with holistic marketing insights, ensuring consistent data integrity, adaptability, and alignment across the business.

Data Harmonization

Harmonization of data is critical in the above case study because fragmented and inconsistent datasets hinder the ability to gain a unified view of their customers, leading to inefficiencies, missed opportunities, and poor decision-making. When data is spread across multiple systems, such as separate Salesforce instances, discrepancies like duplicate records, conflicting contact details, and incomplete histories arise. This fragmentation not only causes confusion among teams but also impacts the accuracy of customer insights, making it challenging to deliver personalized experiences or coordinate cohesive campaigns. By harmonizing data into a single schema, organizations can resolve inconsistencies, preserve valuable historical context, and ensure that every team—from marketing to sales and customer service—has access to a complete and accurate profile for each customer. This unified approach enables better customer engagement, more targeted marketing strategies, and improved operational efficiency, ensuring data integrity and business relevance at every level.

The diagram illustrates a simple example where fragmented data from multiple Salesforce instances is integrated into a unified view by harmonizing it into a single schema, referred to as the Profile Person Dataset. Each Salesforce instance, such as Salesforce Instance 1 and Salesforce Instance 2, operates independently and often contains overlapping or duplicate records with inconsistent attributes like contact details, roles, or activity history. This fragmentation creates challenges for teams such as marketing, sales, and customer service, which require complete and accurate data to deliver personalized customer experiences.

The first step in the process involves exporting data from each Salesforce instance into corresponding datasets, labeled Dataset 1 and Dataset 2 in the diagram. These datasets reflect the unique structures and attributes of their respective source systems. However, the extracted datasets are often misaligned in terms of format, schema, and content, necessitating a transformation process to create a consistent view.

The transformation and harmonization occur in the Process stage, which is powered by Data Distiller’s ETL (Extract, Transform, Load) capabilities. During this stage, the datasets are extracted, deduplicated, and standardized into a unified schema. Advanced techniques such as fuzzy matching (e.g., using Levenshtein distance) are employed to group similar records and resolve duplicates. Business rules are applied to prioritize data, with recent and strategically significant records given higher weight. This ensures that the most relevant and accurate information is retained for each contact. The harmonized data is then loaded into a unified dataset.

The final output of this process is the Profile Person Dataset, represented by the large red box in the diagram. This dataset consolidates all contact records into a single, comprehensive schema, combining historical context, role changes, and marketing relevance. By harmonizing the data, organizations eliminate silos, ensure data integrity, and provide a complete 360-degree view of each contact. This unified dataset enables teams to engage in more effective marketing campaigns, optimize sales strategies, and improve customer service interactions.

The harmonized dataset schema looks like the following that has been supplied as the CSV file for our example:

  • contact_id: Unique identifier for each contact record.

  • instance_id: Identifies the Salesforce instance (e.g., Salesforce_A, Salesforce_B, Salesforce_C) from which the contact originates, suggesting that the same contact may appear across multiple instances.

  • first_name and last_name: Contact’s names, useful for identifying potential duplicates along with the email field.

  • email: Assumed to be the primary identifier for deduplication, as it is likely unique for each contact in real-world scenarios.

  • role: Job role or position of the contact; contacts may have multiple roles across different instances.

  • company: The company associated with each contact; may vary if the contact changes jobs.

  • marketing_stack: Represents the marketing technology stack; certain stacks may be prioritized for marketing purposes.

  • product: Products associated with the contact, potentially reflecting products they interact with or oversee.

  • data_source: Source system providing the contact data (e.g., Salesforce, Adobe CDP, Marketo).

  • last_updated: The last date the contact record was updated, useful for identifying the most recent information.

  • contact_status: Indicates whether the contact is Active or Inactive.

The data looks like the following after harmonization:

Assumptions for Deduplication and Prioritization

  • Email as Primary Identifier: email will be treated as the primary key for deduplication. If multiple records share the same email, they represent the same contact, albeit with potential differences in roles, stacks, or companies.

  • Recent Updates Preferred: The most recent information is more likely to be accurate, so records with a more recent last_updated date will be prioritized.

  • Stack Priority:

    • Suppose we assign weights to stacks: Stack_A > Stack_B > Stack_C > Stack_D. This weighting could reflect strategic priorities in marketing or sales, such as favoring contacts associated with certain technologies or markets.

  • Role Relevance: The role field is important for targeting, especially if the contact holds a decision-making role. When a contact has multiple roles across records, we can aggregate these roles in a historical sequence.

  • Active vs. Inactive Status: Records marked as Active are prioritized for deduplication and analysis, while Inactive contacts can be retained for reference or historical purposes but not for active engagement.

Let us first execute a basic exploratory query:

SELECT * FROM contact_list;

The above shows some patterns as we would expect:

  • Potential Duplicates: Since contacts can exist across multiple instances (instance_id) and may have similar first_name, last_name, or email, there is a possibility of duplicate entries that need to be merged.

  • Multiple Roles and History: Contacts may have different roles or companies over time, which necessitates tracking role and company history for a complete profile.

  • Prioritization Opportunity: With the marketing_stack, last_updated, and contact_status fields, prioritization can be applied to keep the most relevant and recent records for each contact.

Identify and Remove Duplicates Based on Prioritization and Ranking with Email as Deduplication Dimension

The prioritization and ranking process for contact deduplication leverages a composite scoring method to ensure that the most relevant and recent information is retained for each unique email. This method assigns a priority_score to each record by combining two key factors: stack weight and recency. Marketing stacks are assigned weights based on importance, with higher values for preferred stacks (e.g., Stack_A > Stack_B > others), indicating their strategic relevance. Additionally, records are prioritized by last_updated date, where more recent records receive a higher score. By calculating a priority_score as a function of these weights, each contact's records are ranked within their email group. The record with the highest score is selected as the primary entry for each email, thus retaining the most critical and timely data while removing redundant or lower-priority entries. This process ensures that data integrity and business relevance are maximized across the contact list.

-- Prioritize and deduplicate contacts based on stack weight and recency
WITH scored_contacts AS (
    SELECT *,
           -- Assign weights based on marketing_stack
           CASE 
               WHEN marketing_stack = 'Stack_A' THEN 4
               WHEN marketing_stack = 'Stack_B' THEN 3
               WHEN marketing_stack = 'Stack_C' THEN 2
               WHEN marketing_stack = 'Stack_D' THEN 1
               ELSE 0  -- Default weight for unknown stacks
           END AS stack_weight,
           
           -- Calculate recency weight (inverted so that more recent dates are higher)
           DATEDIFF(CURRENT_DATE, last_updated) AS recency_weight,
           
           -- Calculate the priority score: combine stack weight and recency
           (CASE 
               WHEN marketing_stack = 'Stack_A' THEN 4
               WHEN marketing_stack = 'Stack_B' THEN 3
               WHEN marketing_stack = 'Stack_C' THEN 2
               WHEN marketing_stack = 'Stack_D' THEN 1
               ELSE 0
           END * 10) - DATEDIFF(CURRENT_DATE, last_updated) AS priority_score
           
    FROM contact_list
),
ranked_contacts AS (
    -- Rank records by priority score within each email group
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC) AS rank
    FROM scored_contacts
)
-- Select the top-ranked record for each email
SELECT *
FROM ranked_contacts
WHERE rank = 1;

Resolve Multiple Roles

Contacts may have multiple roles in different companies or contexts. We can aggregate the roles associated with each contact's email and then apply the deduplication algorithm from the previous section.

-- Step 1: Aggregate roles for each email address
WITH role_aggregated_contacts AS (
    SELECT email,
           ARRAY_AGG(role) AS role_history  -- Aggregate roles in an array for each email
    FROM contact_list
    GROUP BY email
),

-- Step 2: Join aggregated roles back to original contacts and calculate priority score
scored_contacts AS (
    SELECT c.*,
           ra.role_history,
           -- Assign weights based on marketing_stack
           CASE 
               WHEN c.marketing_stack = 'Stack_A' THEN 4
               WHEN c.marketing_stack = 'Stack_B' THEN 3
               WHEN c.marketing_stack = 'Stack_C' THEN 2
               WHEN c.marketing_stack = 'Stack_D' THEN 1
               ELSE 0  -- Default weight for unknown stacks
           END AS stack_weight,
           
           -- Calculate recency weight (inverted so that more recent dates are higher)
           DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,
           
           -- Calculate the priority score: combine stack weight and recency
           (CASE 
               WHEN c.marketing_stack = 'Stack_A' THEN 4
               WHEN c.marketing_stack = 'Stack_B' THEN 3
               WHEN c.marketing_stack = 'Stack_C' THEN 2
               WHEN c.marketing_stack = 'Stack_D' THEN 1
               ELSE 0
           END * 10) - DATEDIFF(CURRENT_DATE, c.last_updated) AS priority_score
    FROM contact_list AS c
    JOIN role_aggregated_contacts AS ra
    ON c.email = ra.email
),

-- Step 3: Rank and select top-ranked record for each email
ranked_contacts AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC) AS rank
    FROM scored_contacts
)

-- Select the top-ranked record for each email
SELECT *
FROM ranked_contacts
WHERE rank = 1;

The ARRAY_AGG function consolidates all roles associated with each unique email into a single array (role_history) before deduplication, ensuring that no role information is lost. This aggregated array captures a comprehensive role history, allowing us to keep all roles linked to a contact, even if they appeared in different records. By using ARRAY_AGG early in the process, we retain the full context of each contact’s various roles in one place, enabling deeper insights and use cases like historical tracking or personalization, while streamlining the dataset to one deduplicated record per email with all relevant roles intact.

Prioritize Active Contacts

To prioritize active contact records while retaining role history and deprioritizing inactive records, we can adjust the algorithm by modifying the priority_score calculation to give preference to active records. This ensures that active records are ranked higher, while inactive ones are only retained if there are no active records available for the same email.

-- Step 1: Aggregate roles for each email address
WITH role_aggregated_contacts AS (
    SELECT email,
           ARRAY_AGG(role) AS role_history  -- Aggregate roles in an array for each email
    FROM contact_list
    GROUP BY email
),

-- Step 2: Join aggregated roles back to original contacts and calculate priority score with active status prioritized
scored_contacts AS (
    SELECT c.*,
           ra.role_history,
           -- Assign weights based on marketing_stack
           CASE 
               WHEN c.marketing_stack = 'Stack_A' THEN 4
               WHEN c.marketing_stack = 'Stack_B' THEN 3
               WHEN c.marketing_stack = 'Stack_C' THEN 2
               WHEN c.marketing_stack = 'Stack_D' THEN 1
               ELSE 0  -- Default weight for unknown stacks
           END AS stack_weight,
           
           -- Calculate recency weight (inverted so that more recent dates are higher)
           DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,
           
           -- Calculate the priority score with additional weight for active status
           ((CASE 
               WHEN c.marketing_stack = 'Stack_A' THEN 4
               WHEN c.marketing_stack = 'Stack_B' THEN 3
               WHEN c.marketing_stack = 'Stack_C' THEN 2
               WHEN c.marketing_stack = 'Stack_D' THEN 1
               ELSE 0
           END * 10) - DATEDIFF(CURRENT_DATE, c.last_updated)) +
           CASE 
               WHEN c.contact_status = 'Active' THEN 100  -- Add a high weight for active contacts
               ELSE 0
           END AS priority_score
    FROM contact_list AS c
    JOIN role_aggregated_contacts AS ra
    ON c.email = ra.email
),

-- Step 3: Rank and select top-ranked record for each email
ranked_contacts AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC) AS rank
    FROM scored_contacts
)

-- Select the top-ranked record for each email
SELECT *
FROM ranked_contacts
WHERE rank = 1;

Handle Contact Promotion and Track History

To track contact promotions and role changes before deduplication, we can create a history array that captures both the role and the timestamp (last_updated) for each change. This allows us to maintain a historical view of each contact's roles and companies over time. We’ll use ARRAY_AGG to create this array with a combination of role, company, and last_updated for each unique email.

-- Identify contact promotions with role history in chronological order
WITH ordered_contacts AS (
    SELECT email, first_name, last_name, company, role, last_updated
    FROM contact_list
    ORDER BY email, last_updated ASC
)
SELECT email, first_name, last_name,
       company,
       MIN(last_updated) AS first_seen,
       MAX(last_updated) AS last_seen,
       ARRAY_AGG(role) AS role_history  -- Collect roles in already sorted order
FROM ordered_contacts
GROUP BY email, first_name, last_name, company;

Fuzzy Match on email, First Name and Last Name

The query below first performs fuzzy matching using the Levenshtein distance on first_name, last_name, and email to group similar contact records. It then aggregates role and company history, preserving a timeline of roles and company affiliations. Priority is assigned based on active status and marketing stack, with the most recent and strategically significant stacks weighted higher. The query captures the most recent first_name and last_name by joining back on the latest last_updated date, ensuring that the deduplicated contact retains up-to-date personal information. Finally, it ranks each grouped entry by priority score and selects only the top-ranked record for each unique email, providing a single, comprehensive, and prioritized profile for each contact. This solution consolidates contact records efficiently while retaining valuable historical context and ensuring data integrity for accurate customer engagement.

-- Step 1: Identify similar records based on fuzzy matching for first_name, last_name, and email
WITH fuzzy_matched_contacts AS (
    SELECT a.contact_id AS id_a, b.contact_id AS id_b,
           a.email AS email_a, b.email AS email_b,
           a.first_name, a.last_name,
           -- Calculate Levenshtein distances for fuzzy matching
           LEVENSHTEIN(a.first_name, b.first_name) AS first_name_distance,
           LEVENSHTEIN(a.last_name, b.last_name) AS last_name_distance,
           LEVENSHTEIN(a.email, b.email) AS email_distance
    FROM contact_list a
    JOIN contact_list b
    ON a.contact_id < b.contact_id  -- Avoid self-joins and duplicates
    WHERE LEVENSHTEIN(a.first_name, b.first_name) < 3  -- Example threshold
      AND LEVENSHTEIN(a.last_name, b.last_name) < 3
      AND LEVENSHTEIN(a.email, b.email) < 3
),

-- Step 2: Aggregate roles, company history, and calculate priority score
aggregated_contacts AS (
    SELECT a.email AS email,
           ARRAY_AGG(DISTINCT a.role ORDER BY a.last_updated) AS role_history,  -- Aggregate unique roles over time
           ARRAY_AGG(STRUCT(a.company, a.last_updated) ORDER BY a.last_updated) AS company_history,  -- Track company changes over time
           
           -- Prioritize active status, assign stack weight, and calculate recency
           MAX(CASE WHEN a.contact_status = 'Active' THEN 1 ELSE 0 END) AS is_active,
           MAX(
               CASE 
                   WHEN a.marketing_stack = 'Stack_A' THEN 4
                   WHEN a.marketing_stack = 'Stack_B' THEN 3
                   WHEN a.marketing_stack = 'Stack_C' THEN 2
                   WHEN a.marketing_stack = 'Stack_D' THEN 1
                   ELSE 0
               END * 10 - DATEDIFF(CURRENT_DATE, a.last_updated)
           ) AS priority_score,
           
           -- Use the latest `last_updated` to identify the most recent names
           MAX(a.last_updated) AS latest_last_updated
    FROM contact_list a
    JOIN fuzzy_matched_contacts fmc
    ON a.contact_id = fmc.id_a OR a.contact_id = fmc.id_b
    GROUP BY a.email
),

-- Step 3: Join the original contact list again to get the most recent names based on `latest_last_updated`
final_contacts AS (
    SELECT ac.email,
           ac.role_history,
           ac.company_history,
           ac.is_active,
           ac.priority_score,
           c.first_name AS final_first_name,
           c.last_name AS final_last_name
    FROM aggregated_contacts ac
    JOIN contact_list c ON ac.email = c.email AND ac.latest_last_updated = c.last_updated
),

-- Step 4: Select the highest-priority merged record for each email group
ranked_merged_contacts AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY priority_score DESC) AS rank
    FROM final_contacts
)

-- Final selection of deduplicated, merged records
SELECT 
    email,
    final_first_name AS first_name,
    final_last_name AS last_name,
    role_history,
    company_history,
    is_active,
    priority_score
FROM ranked_merged_contacts
WHERE rank = 1;

There is a detailed tutorial on the this method .

here
PREP 500: Ingesting CSV Data into Adobe Experience Platform
IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
1MB
contacts_list.csv
Data Harmonization in Data Distiller
Excel view of the CSV file above is the harmonized dataset.
The results of the exploratory query.
Deduplication based on prioritization.
Roles are aggregated.
Active records are prioritized.
Promotion and company history
Merged contacts with fuzzy match.
Page cover image