ID 101: Channel Identity Lookup Table from Profile Attribute Snapshot Data
An identity lookup table is a database table used to store identities associated with various identity namespaces in the Real-Time Customer Profile.
Prerequisites
You need to get familiar with Profile attribute snapshot dataset explorations. Please complete or browse the following sections before proceeding:
pageAUD 101: Build Segment Overlaps using Profile Attribute Snapshot DatasetsScenario
The profile attribute snapshot dataset is a daily export from the Real-Time Customer Profile. This dataset also contains the identities for the profile data. The identities are encapsulated in a map data structure with identity types (called namespaces in AEP parlance) and identity values. There can be multiple identity types (email, cookie, cross-device such as CRM, etc.) and there can be multiple values within each. You are tasked with transforming this data in the map structure to a relational structure that can function as a lookup table. This lookup table will serve as the foundation of all analytics you will do on Real-Time Customer Profile data and beyond. Whether it is Customer 360, SQL Traits, or ML workflows, this lookup table will be invaluable for those use cases.
Exploring the Map Structure using Data Distiller: The Identity Map
Before you extract the identity information, you need to understand the map structure. Use the to_json feature in Data Distiller to get in place information about the schema structure without having to go to the Schemas UI screen. This is an extremely important feature with deeply nested data structures and maps.
If you execute the following, you will see the following:
What you see above is the identity map structure. The map has the identity namespace (email, crmid
) as the unique key to index the values. Note that this is an array of values. If you need to extract the identities, you will need to use the following code:
The results look like the following:
Typically we would explode this array
identityMap.email.id
gives you the array of email identity values and not just an email identity value and these arrays are great for explode functions.
But there is a problem - plain vanilla "explode" will get rid of rows that do not have any identity values. That is a problem because the absence of an identity is a signal and is possibly associated with other identities that the person may have. Let us fix the issue by using explode_outer from Data Distiller:
Extract Identities from the Map Structure to Create an Identity Lookup Table
But just doing an explode_outer
on a single identity namespace is of no use. We have destroyed the association between identities within that same namespace. Let us generate a synthetic UUID for each profile row to keep track of our identities as we explode them. If we do this right, we will generate a table of UUIDs and related identities as a columnar table that gives us leverage to use it for a variety of operations.
If we take the first element of the email namespace and concatenate that with the first element of the crmid
namespace, we are guaranteed a unique identifier. If not, it would mean that two rows in the profile export dataset have that identity in common. The identity stitching in Real-Time Customer Profile should have taken care of it and merged it into a single row.
Let us now generate the unique UUIDs along with the schema-friendly identity map structure
The results will look like:
But concat
string function does a poor job of concatenating NULL values. We have to remedy that with COALESCE
:
This works and you will get:
If there are two identity namespaces, then the explode_outer
operator works on each, one at a time. Make sure you remove the to_json
as we do not need it anymore:
You will get:
We need to explode this one more time for crmid
and we should get:
The results would be:
Using this table, I can do some very interesting analysis. For example, we can look at the histogram of email
IDs in the system:
The answer looks like this:
Why are we seeing 976 as a bucket and 0 counts for unique_ids? The sum of 8225 and 976 adds up to the number of profiles. Why did we even get this result in the first place? Is there something wrong with the way the SQL is written or is there something more profound happening in the identity graph?
Still cannot figure it out? What if the email and crmid identities were non-existent? You have null rows in the table that you should clean before you do any analytics. For the identity lookup use case which is the goal of this section, it would not matter,
Last updated