IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
An identity lookup table is a database table used to store identities associated with various identity namespaces in the Real-Time Customer Profile.
Last updated
Prerequisites
You need to get familiar with Profile attribute snapshot dataset explorations. Please complete or browse the following sections before proceeding:
Scenario
The profile attribute snapshot dataset is a daily export from the Real-Time Customer Profile. This dataset also contains the identities for the profile data. The identities are encapsulated in a map data structure with identity types (called namespaces in AEP parlance) and identity values. There can be multiple identity types (email, cookie, cross-device such as CRM, etc.) and there can be multiple values within each. You are tasked with transforming this data in the map structure to a relational structure that can function as a lookup table. This lookup table will serve as the foundation of all analytics you will do on Real-Time Customer Profile data and beyond. Whether it is Customer 360, SQL Traits, or ML workflows, this lookup table will be invaluable for those use cases.
Exploring the Map Structure using Data Distiller: The Identity Map
Before you extract the identity information, you need to understand the map structure. Use the to_json feature in Data Distiller to get in place information about the schema structure without having to go to the Schemas UI screen. This is an extremely important feature with deeply nested data structures and maps.
SELECT to_json(identityMap) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;
If you execute the following, you will see the following:
What you see above is the identity map structure. The map has the identity namespace (email, crmid) as the unique key to index the values. Note that this is an array of values. If you need to extract the identities, you will need to use the following code:
SELECT identityMap.email.id FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;
The results look like the following:
Typically we would explode this array
SELECT explode(identityMap.email.id) FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;
identityMap.email.id gives you the array of email identity values and not just an email identity value and these arrays are great for explode functions.
But there is a problem - plain vanilla "explode" will get rid of rows that do not have any identity values. That is a problem because the absence of an identity is a signal and is possibly associated with other identities that the person may have. Let us fix the issue by using explode_outer from Data Distiller:
SELECT explode_outer(identityMap.email.id) AS email FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903
Extract Identities from the Map Structure to Create an Identity Lookup Table
But just doing an explode_outer on a single identity namespace is of no use. We have destroyed the association between identities within that same namespace. Let us generate a synthetic UUID for each profile row to keep track of our identities as we explode them. If we do this right, we will generate a table of UUIDs and related identities as a columnar table that gives us leverage to use it for a variety of operations.
If we take the first element of the email namespace and concatenate that with the first element of the crmid namespace, we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have that identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged it into a single row.
Let us now generate the unique UUIDs along with the schema-friendly identity map structure
SELECT concat(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903
The results will look like:
But concatstring function does a poor job of concatenating NULL values. We have to remedy that with COALESCE:
SELECT COALESCE(identityMap.email.id[0],identityMap.crmid.id[0]) as unique_id,to_json(identityMap) from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903
This works and you will get:
If there are two identity namespaces, then the explode_outeroperator works on each, one at a time. Make sure you remove the to_jsonas we do not need it anymore:
SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
You will get:
We need to explode this one more time forcrmid and we should get:
SELECT unique_id, email, explode_outer(crmid) as crmid FROM (SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
)
The results would be:
Using this table, I can do some very interesting analysis. For example, we can look at the histogram of email IDs in the system:
SELECT bucket, count(unique_id) FROM(SELECT unique_id, count(CASEWHEN email ISNULLTHEN0ELSE email END) as bucket FROM(SELECT unique_id, email, explode_outer(crmid) as crmid FROM (SELECT unique_id, explode_outer(identityMap.email.id) AS email, identityMap.crmid.id AS crmid FROM(SELECT coalesce(identityMap.email.id[0],identityMap.crmid.id[0]) AS unique_id,identityMap from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903)
))GROUP BY unique_id ORDER BY bucket ASC)GROUP BY bucket
The answer looks like this:
Why are we seeing 976 as a bucket and 0 counts for unique_ids? The sum of 8225 and 976 adds up to the number of profiles. Why did we even get this result in the first place? Is there something wrong with the way the SQL is written or is there something more profound happening in the identity graph?
Still cannot figure it out? What if the email and crmid identities were non-existent? You have null rows in the table that you should clean before you do any analytics. For the identity lookup use case which is the goal of this section, it would not matter,