Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Scenario
  • Exploring the Map Structure using Data Distiller: The Identity Map
  • Extract Identities from the Map Structure to Create an Identity Lookup Table
  1. Unit 5: DATA DISTILLER IDENTITY RESOLUTION

IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller

An identity lookup table is a database table used to store identities associated with various identity namespaces in the Real-Time Customer Profile.

Last updated 8 months ago

Prerequisites

You need to get familiar with Profile attribute snapshot dataset explorations. Please complete or browse the following sections before proceeding:

DDA 300: Audience Overlaps with Data Distiller

Scenario

The profile attribute snapshot dataset is a daily export from the Real-Time Customer Profile. This dataset also contains the identities for the profile data. The identities are encapsulated in a map data structure with identity types (called namespaces in AEP parlance) and identity values. There can be multiple identity types (email, cookie, cross-device such as CRM, etc.) and there can be multiple values within each. You are tasked with transforming this data in the map structure to a relational structure that can function as a lookup table. This lookup table will serve as the foundation of all analytics you will do on Real-Time Customer Profile data and beyond. Whether it is Customer 360, SQL Traits, or ML workflows, this lookup table will be invaluable for those use cases.

Exploring the Map Structure using Data Distiller: The Identity Map

Before you extract the identity information, you need to understand the map structure. Use the to_json feature in Data Distiller to get in place information about the schema structure without having to go to the Schemas UI screen. This is an extremely important feature with deeply nested data structures and maps.

SELECT to_json(identityMap) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

If you execute the following, you will see the following:

What you see above is the identity map structure. The map has the identity namespace (email, crmid) as the unique key to index the values. Note that this is an array of values. If you need to extract the identities, you will need to use the following code:

SELECT identityMap.email.id FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The results look like the following:

Typically we would explode this array

SELECT explode(identityMap.email.id) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

identityMap.email.id gives you the array of email identity values and not just an email identity value and these arrays are great for explode functions.

But there is a problem - plain vanilla "explode" will get rid of rows that do not have any identity values. That is a problem because the absence of an identity is a signal and is possibly associated with other identities that the person may have. Let us fix the issue by using explode_outer from Data Distiller:

SELECT explode_outer(identityMap.email.id) AS email  FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

Extract Identities from the Map Structure to Create an Identity Lookup Table

But just doing an explode_outer on a single identity namespace is of no use. We have destroyed the association between identities within that same namespace. Let us generate a synthetic UUID for each profile row to keep track of our identities as we explode them. If we do this right, we will generate a table of UUIDs and related identities as a columnar table that gives us leverage to use it for a variety of operations.

If we take the first element of the email namespace and concatenate that with the first element of the crmid namespace, we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have that identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged it into a single row.

Let us now generate the unique UUIDs along with the schema-friendly identity map structure

SELECT concat(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

The results will look like:

But concat string function does a poor job of concatenating NULL values. We have to remedy that with COALESCE:

SELECT COALESCE(identityMap.email.id[0],identityMap.crmid.id[0]) as unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

This works and you will get:

If there are two identity namespaces, then the explode_outer operator works on each, one at a time. Make sure you remove the to_json as we do not need it anymore:

SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)

You will get:

We need to explode this one more time for crmid and we should get:

SELECT unique_id, email, explode_outer(crmid) as crmid FROM (
SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)

The results would be:

Using this table, I can do some very interesting analysis. For example, we can look at the histogram of email IDs in the system:

SELECT bucket, count(unique_id) FROM(
SELECT unique_id, count(CASE WHEN email IS NULL THEN 0 ELSE email END) as bucket FROM(
SELECT unique_id, email, explode_outer(crmid) as crmid FROM (
SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM
(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
))
GROUP BY unique_id 
ORDER BY bucket ASC)
GROUP BY bucket

The answer looks like this:

Why are we seeing 976 as a bucket and 0 counts for unique_ids? The sum of 8225 and 976 adds up to the number of profiles. Why did we even get this result in the first place? Is there something wrong with the way the SQL is written or is there something more profound happening in the identity graph?

Still cannot figure it out? What if the email and crmid identities were non-existent? You have null rows in the table that you should clean before you do any analytics. For the identity lookup use case which is the goal of this section, it would not matter,

Identity maps structure
Array of identity values
Exploding the array helps you extract the identities.
explode_outer retains the nulls for the identity namespaces
Generate the unique ID without llosing identity associations.
COALESCE function retrieves the first non-zero value in a list and is a good choice for a unique ID for the profile in our example.
email identities have been separated into separate rows without breaking profile association.
Identity lookup table in relational form
Page cover image