IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
Learn algorithmic techniques for merging, deduplicating, and enriching B2B contact data to create unified, accurate profiles using Data Distiller
Last updated
Prerequisites
Download the following file:
Ingest the file by following this tutorial:
Also we will be using:
Case Study Overview
The case study is about a financial company that is facing significant challenges with contact records across various systems, particularly multiple instances of Salesforce, Adobe CDP, and Marketo. The challenges arise from data fragmentation, lack of standardization, duplication, and governance issues, impacting both marketing and sales teams. Here’s an in-depth look at each challenge and how it affects operations, followed by potential solutions for addressing them using SQL on a contact list dataset:
Fragmented Data Across Multiple Salesforce Instances: Contacts are stored in multiple Salesforce instances. Some users only have access to a subset of instances, resulting in incomplete data visibility. This limits the ability of users (especially in sales and customer service) to see the full history or context of a contact, impacting customer interactions and leading to missed opportunities.
Duplicate Contacts: Duplicate contacts exist within and across multiple instances, creating conflicting data points. Duplicates lead to confusion as different versions of the same contact may contain conflicting details, such as job titles, roles, and companies, which affect marketing and sales targeting.
Data Footprint in Adobe CDP: Only a small subset of users has access to the digital footprint data captured in Adobe CDP. Sales and customer-facing teams lack critical insights derived from digital interactions, reducing their ability to personalize engagements effectively.
Multiple Email Addresses Per User: Adobe CDP has identified multiple email addresses for some users, merging them into a single user record, but this data is not synchronized back to sales systems. Inconsistent email addresses create a fragmented view in sales systems, which may lead to duplicated outreach or incomplete activity history.
Disconnected Marketing Technology Stacks: Multiple marketing stacks (e.g., associated with different instances) are not integrated, preventing a cohesive enterprise-wide campaign view. Marketing messages may become redundant or disconnected as customers are targeted by individual product campaigns instead of a unified brand campaign.
Lack of Governance for Contact Creation: Users can create new contacts in core Salesforce, as long as the email is different, even if the contact exists in another instance.This results in scattered records for the same contact across systems, making it challenging to maintain an accurate, centralized customer profile.
Multiple Roles for Contacts: Contacts may hold multiple roles across organizations, which impacts the type of messaging they should receive. Inconsistent role-based communication strategies can cause confusion, as the same contact might be messaged for multiple roles within different campaigns, leading to mixed messaging.
Tracking Contact Promotions and Role Changes: There’s no enterprise-wide tracking of contacts' role changes or promotions. When a contact transitions to a new role (e.g., promotion or job change), users may inadvertently lose the contact's activity history, limiting the personalization potential for future interactions.
Difficulty with Attribution in Marketing: Contacts spread across multiple Salesforce instances make it hard to track and attribute marketing activities accurately. Inaccurate attribution data impacts budget allocation decisions, making it challenging for marketing teams to understand the effectiveness of their campaigns or optimize spending.
Dataset Strategy for Supporting Sales Team-Specific Rules and Marketing-Level Cohesion
When working with diverse sales teams, each with its unique business rules and priorities, a robust dataset strategy must balance the need for individualized algorithms with the overarching goal of enabling marketing to look across all sales organizations cohesively. This strategy ensures that the data remains harmonized at the schema level but allows for flexibility in processing and prioritizing information to suit both localized needs and enterprise-wide insights. Here’s how such a strategy can be designed:
Custom Algorithms for Each Sales Organization
Each sales team operates with specific business rules and requirements. To support this, we implement custom algorithms for processing the harmonized dataset for each sales organization. These algorithms allow for:
Tailoring how data is aggregated, deduplicated, and prioritized based on the sales team's operational focus.
Generating attributes unique to that sales organization, such as region-specific metrics or custom scoring models for leads.
Aligning with local sales strategies while ensuring the outputs conform to a standard schema for cross-organizational interoperability.
For example, Sales Org A might prioritize customer engagement metrics, while Sales Org B focuses on product affinity scores. These differences are captured in their respective datasets.
Harmonization with a Single Schema
Although the datasets for each sales organization are processed with different algorithms, the outputs adhere to a common schema. This standardized schema ensures that attributes across datasets are aligned and comparable. For instance:
Attributes like email, first_name, and purchase_history remain consistent across all datasets.
Unique business rules are applied at the processing level but do not compromise the schema's integrity.
This harmonization enables the Profile Store to ingest and manage all datasets seamlessly while retaining each sales organization's specific details.
Ingesting Datasets into the Profile Store
The datasets for each sales organization are ingested into the Profile Store, which serves as the central repository for customer data. Each dataset is preserved as an independent layer within the Profile Store, ensuring that:
Marketing and sales teams can reference the datasets individually or collectively.
Data lineage is maintained, enabling traceability of attributes back to their originating sales organization.
Dynamic Data Selection Using Merge Policies
The Profile Store enables dynamic selection of datasets through merge policies, which define how datasets are prioritized and combined:
Merge Policies by Sales Organization: Individual sales teams can specify merge policies that prioritize their dataset when creating audiences or running segmentation.
Cross-Organization Merge Policies: Marketing can define merge policies that aggregate and harmonize datasets using Data Distiller across sales organizations, providing a unified view of customer data.
For example, one merge policy might prioritize the most recent dataset from a specific sales org, while another aggregates the highest-value attributes across all sales orgs.
Flexible Segmentation and Personalization Contexts
Merge policies allow for dynamic person audience creation at a granular level. This ensures:
Sales-Specific Views: Sales teams can work within their own datasets, creating audiences and personalization rules that align with their business priorities.
Marketing-Level Insights: Marketing can look across all sales organizations by selecting merge policies that harmonize datasets, enabling segmentation and personalization at the enterprise level.
By switching merge policies, teams can seamlessly transition between localized and global perspectives without altering the underlying datasets.
Profile Snapshots for Merge Policies
To ensure operational flexibility and avoid conflicts between teams, Profile Snapshots are created for each merge policy. These snapshots capture:
The state of the unified profile dataset under a specific merge policy at a given time.
A static, reproducible dataset for segmentation, analysis, or experimentation without disrupting live datasets.
For example, marketing can generate a snapshot using a cross-organization merge policy for a campaign, while a sales org uses a snapshot of its own dataset for a regional initiative.
A Unified Yet Flexible Dataset Strategy
This strategy allows each sales organization to operate within its unique business rules while maintaining a harmonized data foundation. The use of custom algorithms ensures localized relevance, while the standardized schema and Profile Store enable enterprise-wide cohesion. Merge policies and Profile Snapshots provide the flexibility needed for segmentation and personalization at both the sales and marketing levels. This approach empowers the organization to balance tailored sales strategies with holistic marketing insights, ensuring consistent data integrity, adaptability, and alignment across the business.
Data Harmonization
Harmonization of data is critical in the above case study because fragmented and inconsistent datasets hinder the ability to gain a unified view of their customers, leading to inefficiencies, missed opportunities, and poor decision-making. When data is spread across multiple systems, such as separate Salesforce instances, discrepancies like duplicate records, conflicting contact details, and incomplete histories arise. This fragmentation not only causes confusion among teams but also impacts the accuracy of customer insights, making it challenging to deliver personalized experiences or coordinate cohesive campaigns. By harmonizing data into a single schema, organizations can resolve inconsistencies, preserve valuable historical context, and ensure that every team—from marketing to sales and customer service—has access to a complete and accurate profile for each customer. This unified approach enables better customer engagement, more targeted marketing strategies, and improved operational efficiency, ensuring data integrity and business relevance at every level.
The diagram illustrates a simple example where fragmented data from multiple Salesforce instances is integrated into a unified view by harmonizing it into a single schema, referred to as the Profile Person Dataset. Each Salesforce instance, such as Salesforce Instance 1 and Salesforce Instance 2, operates independently and often contains overlapping or duplicate records with inconsistent attributes like contact details, roles, or activity history. This fragmentation creates challenges for teams such as marketing, sales, and customer service, which require complete and accurate data to deliver personalized customer experiences.
The first step in the process involves exporting data from each Salesforce instance into corresponding datasets, labeled Dataset 1 and Dataset 2 in the diagram. These datasets reflect the unique structures and attributes of their respective source systems. However, the extracted datasets are often misaligned in terms of format, schema, and content, necessitating a transformation process to create a consistent view.
The transformation and harmonization occur in the Process stage, which is powered by Data Distiller’s ETL (Extract, Transform, Load) capabilities. During this stage, the datasets are extracted, deduplicated, and standardized into a unified schema. Advanced techniques such as fuzzy matching (e.g., using Levenshtein distance) are employed to group similar records and resolve duplicates. Business rules are applied to prioritize data, with recent and strategically significant records given higher weight. This ensures that the most relevant and accurate information is retained for each contact. The harmonized data is then loaded into a unified dataset.
The final output of this process is the Profile Person Dataset, represented by the large red box in the diagram. This dataset consolidates all contact records into a single, comprehensive schema, combining historical context, role changes, and marketing relevance. By harmonizing the data, organizations eliminate silos, ensure data integrity, and provide a complete 360-degree view of each contact. This unified dataset enables teams to engage in more effective marketing campaigns, optimize sales strategies, and improve customer service interactions.
The harmonized dataset schema looks like the following that has been supplied as the CSV file for our example:
contact_id: Unique identifier for each contact record.
instance_id: Identifies the Salesforce instance (e.g., Salesforce_A, Salesforce_B, Salesforce_C) from which the contact originates, suggesting that the same contact may appear across multiple instances.
first_name and last_name: Contact’s names, useful for identifying potential duplicates along with the email field.
email: Assumed to be the primary identifier for deduplication, as it is likely unique for each contact in real-world scenarios.
role: Job role or position of the contact; contacts may have multiple roles across different instances.
company: The company associated with each contact; may vary if the contact changes jobs.
marketing_stack: Represents the marketing technology stack; certain stacks may be prioritized for marketing purposes.
product: Products associated with the contact, potentially reflecting products they interact with or oversee.
data_source: Source system providing the contact data (e.g., Salesforce, Adobe CDP, Marketo).
last_updated: The last date the contact record was updated, useful for identifying the most recent information.
contact_status: Indicates whether the contact is Active or Inactive.
The data looks like the following after harmonization:
Assumptions for Deduplication and Prioritization
Email as Primary Identifier: email will be treated as the primary key for deduplication. If multiple records share the same email, they represent the same contact, albeit with potential differences in roles, stacks, or companies.
Recent Updates Preferred: The most recent information is more likely to be accurate, so records with a more recent last_updated date will be prioritized.
Stack Priority:
Suppose we assign weights to stacks: Stack_A > Stack_B > Stack_C > Stack_D. This weighting could reflect strategic priorities in marketing or sales, such as favoring contacts associated with certain technologies or markets.
Role Relevance: The role field is important for targeting, especially if the contact holds a decision-making role. When a contact has multiple roles across records, we can aggregate these roles in a historical sequence.
Active vs. Inactive Status: Records marked as Active are prioritized for deduplication and analysis, while Inactive contacts can be retained for reference or historical purposes but not for active engagement.
Let us first execute a basic exploratory query:
SELECT*FROM contact_list;
The above shows some patterns as we would expect:
Potential Duplicates: Since contacts can exist across multiple instances (instance_id) and may have similar first_name, last_name, or email, there is a possibility of duplicate entries that need to be merged.
Multiple Roles and History: Contacts may have different roles or companies over time, which necessitates tracking role and company history for a complete profile.
Prioritization Opportunity: With the marketing_stack, last_updated, and contact_status fields, prioritization can be applied to keep the most relevant and recent records for each contact.
Identify and Remove Duplicates Based on Prioritization and Ranking with Email as Deduplication Dimension
The prioritization and ranking process for contact deduplication leverages a composite scoring method to ensure that the most relevant and recent information is retained for each unique email. This method assigns a priority_score to each record by combining two key factors: stack weight and recency. Marketing stacks are assigned weights based on importance, with higher values for preferred stacks (e.g., Stack_A > Stack_B > others), indicating their strategic relevance. Additionally, records are prioritized by last_updated date, where more recent records receive a higher score. By calculating a priority_score as a function of these weights, each contact's records are ranked within their email group. The record with the highest score is selected as the primary entry for each email, thus retaining the most critical and timely data while removing redundant or lower-priority entries. This process ensures that data integrity and business relevance are maximized across the contact list.
-- Prioritize and deduplicate contacts based on stack weight and recencyWITH scored_contacts AS (SELECT*,-- Assign weights based on marketing_stackCASEWHEN marketing_stack ='Stack_A'THEN4WHEN marketing_stack ='Stack_B'THEN3WHEN marketing_stack ='Stack_C'THEN2WHEN marketing_stack ='Stack_D'THEN1ELSE0-- Default weight for unknown stacksENDAS stack_weight,-- Calculate recency weight (inverted so that more recent dates are higher)DATEDIFF(CURRENT_DATE, last_updated) AS recency_weight,-- Calculate the priority score: combine stack weight and recency (CASEWHEN marketing_stack ='Stack_A'THEN4WHEN marketing_stack ='Stack_B'THEN3WHEN marketing_stack ='Stack_C'THEN2WHEN marketing_stack ='Stack_D'THEN1ELSE0END*10) -DATEDIFF(CURRENT_DATE, last_updated) AS priority_scoreFROM contact_list),ranked_contacts AS (-- Rank records by priority score within each email groupSELECT*,ROW_NUMBER() OVER (PARTITIONBY email ORDER BY priority_score DESC) AS rankFROM scored_contacts)-- Select the top-ranked record for each emailSELECT*FROM ranked_contactsWHERE rank =1;
Resolve Multiple Roles
Contacts may have multiple roles in different companies or contexts. We can aggregate the roles associated with each contact's email and then apply the deduplication algorithm from the previous section.
-- Step 1: Aggregate roles for each email addressWITH role_aggregated_contacts AS (SELECT email, ARRAY_AGG(role) AS role_history -- Aggregate roles in an array for each emailFROM contact_listGROUP BY email),-- Step 2: Join aggregated roles back to original contacts and calculate priority scorescored_contacts AS (SELECT c.*, ra.role_history,-- Assign weights based on marketing_stackCASEWHEN c.marketing_stack ='Stack_A'THEN4WHEN c.marketing_stack ='Stack_B'THEN3WHEN c.marketing_stack ='Stack_C'THEN2WHEN c.marketing_stack ='Stack_D'THEN1ELSE0-- Default weight for unknown stacksENDAS stack_weight,-- Calculate recency weight (inverted so that more recent dates are higher)DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,-- Calculate the priority score: combine stack weight and recency (CASEWHEN c.marketing_stack ='Stack_A'THEN4WHEN c.marketing_stack ='Stack_B'THEN3WHEN c.marketing_stack ='Stack_C'THEN2WHEN c.marketing_stack ='Stack_D'THEN1ELSE0END*10) -DATEDIFF(CURRENT_DATE, c.last_updated) AS priority_scoreFROM contact_list AS cJOIN role_aggregated_contacts AS raON c.email = ra.email),-- Step 3: Rank and select top-ranked record for each emailranked_contacts AS (SELECT*,ROW_NUMBER() OVER (PARTITIONBY email ORDER BY priority_score DESC) AS rankFROM scored_contacts)-- Select the top-ranked record for each emailSELECT*FROM ranked_contactsWHERE rank =1;
The ARRAY_AGGfunction consolidates all roles associated with each unique email into a single array (role_history) before deduplication, ensuring that no role information is lost. This aggregated array captures a comprehensive role history, allowing us to keep all roles linked to a contact, even if they appeared in different records. By using ARRAY_AGG early in the process, we retain the full context of each contact’s various roles in one place, enabling deeper insights and use cases like historical tracking or personalization, while streamlining the dataset to one deduplicated record per email with all relevant roles intact.
Prioritize Active Contacts
To prioritize active contact records while retaining role history and deprioritizing inactive records, we can adjust the algorithm by modifying the priority_score calculation to give preference to active records. This ensures that active records are ranked higher, while inactive ones are only retained if there are no active records available for the same email.
-- Step 1: Aggregate roles for each email addressWITH role_aggregated_contacts AS (SELECT email, ARRAY_AGG(role) AS role_history -- Aggregate roles in an array for each emailFROM contact_listGROUP BY email),-- Step 2: Join aggregated roles back to original contacts and calculate priority score with active status prioritizedscored_contacts AS (SELECT c.*, ra.role_history,-- Assign weights based on marketing_stackCASEWHEN c.marketing_stack ='Stack_A'THEN4WHEN c.marketing_stack ='Stack_B'THEN3WHEN c.marketing_stack ='Stack_C'THEN2WHEN c.marketing_stack ='Stack_D'THEN1ELSE0-- Default weight for unknown stacksENDAS stack_weight,-- Calculate recency weight (inverted so that more recent dates are higher)DATEDIFF(CURRENT_DATE, c.last_updated) AS recency_weight,-- Calculate the priority score with additional weight for active status ((CASEWHEN c.marketing_stack ='Stack_A'THEN4WHEN c.marketing_stack ='Stack_B'THEN3WHEN c.marketing_stack ='Stack_C'THEN2WHEN c.marketing_stack ='Stack_D'THEN1ELSE0END*10) -DATEDIFF(CURRENT_DATE, c.last_updated)) +CASEWHEN c.contact_status ='Active'THEN100-- Add a high weight for active contactsELSE0ENDAS priority_scoreFROM contact_list AS cJOIN role_aggregated_contacts AS raON c.email = ra.email),-- Step 3: Rank and select top-ranked record for each emailranked_contacts AS (SELECT*,ROW_NUMBER() OVER (PARTITIONBY email ORDER BY priority_score DESC) AS rankFROM scored_contacts)-- Select the top-ranked record for each emailSELECT*FROM ranked_contactsWHERE rank =1;
Handle Contact Promotion and Track History
To track contact promotions and role changes before deduplication, we can create a history array that captures both the role and the timestamp (last_updated) for each change. This allows us to maintain a historical view of each contact's roles and companies over time. We’ll use ARRAY_AGG to create this array with a combination of role, company, and last_updated for each unique email.
-- Identify contact promotions with role history in chronological orderWITH ordered_contacts AS (SELECT email, first_name, last_name, company, role, last_updatedFROM contact_listORDER BY email, last_updated ASC)SELECT email, first_name, last_name, company,MIN(last_updated) AS first_seen,MAX(last_updated) AS last_seen, ARRAY_AGG(role) AS role_history -- Collect roles in already sorted orderFROM ordered_contactsGROUP BY email, first_name, last_name, company;
Fuzzy Match on email, First Name and Last Name
There is a detailed tutorial on the this method here.
The query below first performs fuzzy matching using the Levenshtein distance on first_name, last_name, and emailto group similar contact records. It then aggregates role andcompany history, preserving a timeline of roles and company affiliations. Priority is assigned based on active status and marketing stack, with the most recent and strategically significant stacks weighted higher. The query captures the most recent first_name and last_name by joining back on the latest last_updated date, ensuring that the deduplicated contact retains up-to-date personal information. Finally, it ranks each grouped entry by priority score and selects only the top-ranked record for each unique email, providing a single, comprehensive, and prioritized profile for each contact. This solution consolidates contact records efficiently while retaining valuable historical context and ensuring data integrity for accurate customer engagement.
-- Step 1: Identify similar records based on fuzzy matching for first_name, last_name, and emailWITH fuzzy_matched_contacts AS (SELECT a.contact_id AS id_a, b.contact_id AS id_b, a.email AS email_a, b.email AS email_b, a.first_name, a.last_name,-- Calculate Levenshtein distances for fuzzy matching LEVENSHTEIN(a.first_name, b.first_name) AS first_name_distance, LEVENSHTEIN(a.last_name, b.last_name) AS last_name_distance, LEVENSHTEIN(a.email, b.email) AS email_distanceFROM contact_list aJOIN contact_list bON a.contact_id < b.contact_id -- Avoid self-joins and duplicatesWHERE LEVENSHTEIN(a.first_name, b.first_name) <3-- Example thresholdAND LEVENSHTEIN(a.last_name, b.last_name) <3AND LEVENSHTEIN(a.email, b.email) <3),-- Step 2: Aggregate roles, company history, and calculate priority scoreaggregated_contacts AS (SELECT a.email AS email, ARRAY_AGG(DISTINCT a.role ORDER BY a.last_updated) AS role_history, -- Aggregate unique roles over time ARRAY_AGG(STRUCT(a.company, a.last_updated) ORDER BY a.last_updated) AS company_history, -- Track company changes over time-- Prioritize active status, assign stack weight, and calculate recencyMAX(CASEWHEN a.contact_status ='Active'THEN1ELSE0END) AS is_active,MAX(CASEWHEN a.marketing_stack ='Stack_A'THEN4WHEN a.marketing_stack ='Stack_B'THEN3WHEN a.marketing_stack ='Stack_C'THEN2WHEN a.marketing_stack ='Stack_D'THEN1ELSE0END*10-DATEDIFF(CURRENT_DATE, a.last_updated) ) AS priority_score,-- Use the latest `last_updated` to identify the most recent namesMAX(a.last_updated) AS latest_last_updatedFROM contact_list aJOIN fuzzy_matched_contacts fmcON a.contact_id = fmc.id_a OR a.contact_id = fmc.id_bGROUP BY a.email),-- Step 3: Join the original contact list again to get the most recent names based on `latest_last_updated`final_contacts AS (SELECT ac.email, ac.role_history, ac.company_history, ac.is_active, ac.priority_score, c.first_name AS final_first_name, c.last_name AS final_last_nameFROM aggregated_contacts acJOIN contact_list c ON ac.email = c.email AND ac.latest_last_updated = c.last_updated),-- Step 4: Select the highest-priority merged record for each email groupranked_merged_contacts AS (SELECT*,ROW_NUMBER() OVER (PARTITIONBY email ORDER BY priority_score DESC) AS rankFROM final_contacts)-- Final selection of deduplicated, merged recordsSELECT email, final_first_name AS first_name, final_last_name AS last_name, role_history, company_history, is_active, priority_scoreFROM ranked_merged_contactsWHERE rank =1;