ENRICH 200: Decile-Based Audiences with Data Distiller

Bucketing is a technique used by marketers to split their audience along a dimension and use that to fine-tune the targeting.

Why Enrich the Real-Time Customer Profile?

Let us take a step back and understand the building blocks of being able to personalize or even deliver a plain vanilla experience:

Data gathering: You need a mechanism to collect the data about the customer from as many channels as possible.
Identity resolution: You will need to resolve the identities across the channel data so that you can make sense of the Profile.
Segmentation: Queries that group profiles based on various conditions.
Activation: Send the qualified profiles out as soon as possible with the appropriate metadata for personalization whenever applicable.

The data that you gather will contain attributes, time-stamped behaviors, and pre-existing segment memberships (possibly from another system). Raw behavioral data mostly constitutes up to 99% of all data that you will gather. If you pump this into any database whether it be a warehouse or a NoSQL database, your segmentation queries will overwhelm the system. If it does not overwhelm the system, be ready for a fat bill from the vendor.

To address this, we need a strategy to architect a trade-off: real-time computation vs. cost.

Real-Time Segmentation: The Real-Time Customer Profile store is a NoSQL database that is optimized for near real-time segmentation on attributes, behaviors, and segment memberships. Real-time segmentation implies that the conditions required for grouping the profiles are simple enough to be evaluated fast. Most of these conditions are with short time frames involving counts of events that occurred and attributes. At the minimum, for the real-time segmentation path to work, we need to make sure that those events are available within the database.
Batch Segmentation: For more complex queries, most real-time systems will compute these offline or in batch. Batch segmentation happens in the Real-Time Customer Profile on a daily basis. The same applies to most warehouse implementations as well. We could pre-compute the micro-conditions in the batch segmentation logic as SQL-based attributes and just feed these attributes to the batch-processing segmentation engine. By doing so, we have reduced the size of the data that we are pumping into the database thereby lowering our costs.
1. Batch Processing on Database: This technique is very common in the industry with vendors using terms such as computed attributes, calculated traits, SQL traits, etc. However, most vendors require the computation of these traits on the database itself thereby increasing the costs. Warehousing engines or even NoSQL databases are just not built for batch processing scale on the behavioral data that you will encounter in the domain of a CDP.
2. Batch Processing on Data Lake: Instead of using the compute resources of the database (warehouse or otherwise) which are expensive for complex queries, we are using the compute resources of the Data Distiller's batch processing engine on a data lake to reduce the cost by an order of magnitude. We can use our savings to compute newer kinds of attributes that can further give us more options for segmentation. As we are developing these newer attributes, we can work closely with the data science team to design profile features as well.

Scenario

We will be using Data Distiller to generate yearly purchase values for profiles in the Real-Time Customer Profile. We will use that information to segment this population into ten buckets and then bring that information back into the Profile for segmentation and targeting. Also, by creating such computational attributes or SQL traits, you are compressing the pattern of behavior into a derived characteristic of the individual thus reducing the need to store all of the behavioral data in Real-Time Customer Profile. The complex computation encapsulates the essence of the behavior which is also easy for a marketer to grasp and use.

Prerequisites

You need to have Adobe Real-Time CDP set up and operating so that you can execute and access the example. The example relies on data generated by the Real-Time Customer Profile.

You will also need to make sure you have completed this section before proceeding further. At the very least, you should be familiar with how Profile Attribute snapshot datasets work.

Generate a Randomized Yearly Purchases Dataset

First, we will extract the email and CRM identities from all of the identity maps. We will be using this as the key for our random dataset:

SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903

You will get a result that looks like the following:

If you do not have access to a profile snapshot dataset, you can use the dummy data in the CSV file below as a substitute for the table above:

Your queries will change and look simpler. You just need to replace the code fragment that we did above with:

SELECT email, crmid FROM identity_data

If you want a tutorial on how to ingest CSV data, please consult this example:

Let us generate the randomized yearly purchase values

SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903);

The results will be:

Let us carry out some basic cleaning operations to remove null identities (email and crmid) in the dataset.

SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') 
WHERE crmid !='';

The results will be:

Create the Decile Buckets

We need to use the NTILE window function that lets you split the yearly purchases attribute after sorting them into 10 equal-sized buckets and adding this bucket number as a new attribute. I cvan change this to any number of buckets I want.

SELECT *, NTILE(10) 
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') 
WHERE crmid !='');

Note that we did not use the partition dimension clause which is all but a grouping dimension to split/partition the dataset and apply the NTILE logic on each of the partitions. In our case, we have a single dataset and no grouping dimensions such as location. If used such a partitioning dimension such as location, then the decile computation would be done for each partition.

The results are:

Let us verify that the decile bucket logic is working as designed. Let us first find the total number of records:

SELECT DISTINCT COUNT(*) FROM
(SELECT *, NTILE(10) 
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') 
WHERE crmid !='')
);

The result will be 6000 records. Let me count the number of records per decile bucket and also find the minimum and maximum values for the yearly purchase data for each of the buckets.

SELECT decile_bucket, COUNT(decile_bucket), MIN(YearlyPurchase_Dollars),MAX(YearlyPurchase_Dollars) FROM
(SELECT *, NTILE(10) 
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, round(10000*abs(randn(1))) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') 
WHERE crmid !='')
)
GROUP BY decile_bucket ORDER BY decile_bucket ASC ;

Here are the results of that query:

If I could figure out a way to ingest this attribute data into a Real-Time Customer Profile, the minimum and maximum values of the thresholds give me enough flexibility to define an audience that stacks up to a maximum size of 6000 members. If I use the yearly purchase conditions from 4.0 to 6820.0, I should get 5x600=3000 members. So, by using this decile technique, I now have full control of the reach while monitoring for the focus targeting via the yearly purchase dimension. Therefore, with a single attribute dimension, the focus and reach of your campaign are inversely proportional. Also, note that the decile buckets are labeled as numbers. It pays to sit down with marketing and define more intuitive names for these buckets that they all rally around.

Enrich Real-Time Customer Profile with Derived Fields

A Data Distiller Derived Attribute is a field in a dataset that is not directly observed or collected but is created or computed from existing data. The derived attribute values are typically generated through transformations, calculations, or by combining multiple existing field values to offer new insights or improve model performance. In some cases, derived attributes are simple pass-throughs of existing fields, where no transformation or calculation is needed. These fields retain the original values but are reorganized to fit specific analytical or modeling purposes.

The dataset we want to create is an attribute dataset for Real-Time Customer Profiles. To make this happen, we will need to create a custom schema of Individual Profile Schema type and add the following custom fields as shown below. Plus, we will also need to at least specify a primary identity and mark the schema for Profile. Marking the schema in this specific way notifies the database of the layout of the data. Alternatively, we can create a schema that mimics the schema on the fly (called ad hoc schema in Data Distiller) that gives you the flexibility to define these schemas in SQL code within the SQL editor.

The action of marking a schema for a Real-Time Customer Profile cannot be undone. What this means is that if you are not careful about how you go about creating schemas and adding them, you will end up with a lot of 'deadwood" schemas that will clutter up the Union view. With this risk in mind, we should use the UI or API to create the definitive schemas, populate datasets and then mark them for Profile. Creating ad hoc schemas are useful for quick prototyping or creating intermediate datasets but remember, that with great power comes great responsibility. In any situation, where you creating a final set of datasets for an app within the Adobe ecosystem or elsewhere, pay attention to your schema design. At the very least, have them defined well.

There is more flexibility with datasets as they can be marked and unmarked for Profile. Marking a dataset for Profile means that the database monitors for new data from that point onwards. If you delete an attribute dataset, Real-Time Customer Profile will delete the attributes. The same is true with event data as well. The same is true if TTL or dataset expiration is applied to these Profille-marked datasets. These actions have different consequences for the Identity Store - deletion of datasets results in the cleaning of the identity graph on a daily basis. TTL on Profile-marked datasets does not propagate to Identity Graph.

Observe the data types of the various fields -the Yearly_Purchases_Dollars field is of integer type.

Please check the guardrails for Real-Time Customer Profile based on the entitlement you have. There are recommendations provided for the number of attribute (20) and event (20) datasets that can be attached to it. Also, there is a limit on the number of batches of data that can be ingested per day (90) into the Profile database as well. These constraints can be addressed by using a pipeline architecture that consolidates datasets and running them on the same schedule to create fewer batches of data.

There are two ways for me to create a dataset and mark it for Real-Time Customer Profile:

Create a brand new dataset with every single update of the yearly purchases data: If our use case was to accommodate rolling 365-day purchases with more weightage to the recent purchases, then we have no choice but to create a new table with every run i.e. daily. In this case, you would DROP the table every day and automate the addition of this data to the profile.
Insert and append into an existing dataset with every run for new updates of the yearly purchase data. If we want to retire the old or updated data, it will require some new data techniques (timestamping and snapshotting) that we will not cover in this example.

In both cases, as long as the attribute dataset has been marked for Profile, the Real-Time Customer Profile will keep monitoring for new batches of attribute data from the data lake. Marking a dataset for Profile means that we have to do this manually in the dataset UI. If we drop the dataset or delete it, we would need to do this manual step every single time. This leaves up with a 3-part strategy:

Create a one-time empty dataset or do this on a periodic basis so that we can manage dataset size.
Mark the empty dataset for Profile.
Append new attribute data into this dataset. New attribute data for the same profile will be overwritten even though multiple records of the same data now exist in the data lake. As a reminder, Adobe Experience Platform today supports append-only semantics. Update operations are not yet supported.

Create an Empty Dataset for Real-Time Customer Profile

We need to create the empty dataset first because the Profile store only monitors new batches of data after we mark the dataset. If we inserted data into the dataset and then marked it for Profile store, the batches of data will not be ingested into Profile.

Warning: The reason why we are not creating a dataset from the UI by going into Workflows->Create dataset from Schema is because of the limitation in Adobe Experience Platform that these datasets cannot be dropped (using DROP TABLE) in Data Distiller.

Here is the code for creating an empty dataset:

DROP TABLE IF EXISTS decile_attribute_dataset_example;
CREATE TABLE decile_attribute_dataset_example WITH (schema='Derived_Attributes_Deciles', Label='PROFILE') AS 
(SELECT struct(email AS email,
               crmid AS crmid, 
               YearlyPurchase_Dollars AS Yearly_Purchase_Dollars, 
               decile_bucket decile_bucket) AS _pfreportingonprod
FROM (
SELECT *, NTILE(10) 
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, CAST(round(10000*abs(randn(1))) AS INT) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') WHERE crmid !=''))
WHERE FALSE
);

This code will execute successfully and you will see:

Let us analyze the code:

DROP TABLE: Since we are creating a brand new dataset for every single run i.e. daily or weekly, we should delete the previous dataset and create a new one. Note the limitation that DROP TABLE will not work f you create an empty dataset by using Workflows->Create dataset from schema. You should only use Data Distiller to create the empty dataset if you want to be able to drop it.
struct: The structure maps the input fields such as YearlyPurchase_Dollars from the select query below to the schema field Yearly_Purchase_Dollars. You could create any hierarchy of objects by using this mapping. For example, we could have also created a custom schema that had two objects in it such as a purchases object and a identity_fields object. In that case, the code would have been:

........
SELECT
  STRUCT(
    STRUCT(
        STRUCT(
             YearlyPurchase_Dollars Yearly_Purchase_Dollars
            ) AS purchases
        email as email,
        crmid as crmid) AS identity_fields 
    ) AS _pfreportingonprod,
.......

It is imperative during the prototyping stage that you double-check that the dataset was created

SELECT * FROM decile_attribute_dataset;

schema='Derived_Attributes_Deciles' specifies that the data layout must conform to our created XDM schema.
CAST(round(10000*abs(randn(1))) AS INT): We added this in the core code to match the integer data type of Yearly_Purchase_Dollars in the schema.
WHERE 1=2: Note the code in the last line where we are creating a contradiction to create the empty dataset. Make sure you execute the following command:

SELECT * FROM decile_attribute_dataset_example;

Mark the Empty Dataset for Real-Time Customer Profile

Go into the Dataset UI and mark the dataset for Profile:

Append Data to an Existing Dataset for Real-Time Customer Profile

Let us now insert data into the empty table:

INSERT INTO decile_attribute_dataset_example
(SELECT struct(email AS email,
               crmid AS crmid, 
               YearlyPurchase_Dollars AS Yearly_Purchase_Dollars, 
               decile_bucket decile_bucket) AS _pfreportingonprod
FROM (
SELECT *, NTILE(10) 
OVER (ORDER BY YearlyPurchase_Dollars ASC) AS decile_bucket FROM
(SELECT * FROM (SELECT email, crmid, CAST(round(10000*abs(randn(1))) AS INT) AS YearlyPurchase_Dollars 
FROM (SELECT identitymap.email.id[0] AS email, identitymap.crmid.id[0] 
AS crmid from profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903) 
WHERE email !='') WHERE crmid !='')));

Verify that Data is Loaded into Real-Time Customer Profile

Let us retrieve the first row of the dataset we created for Profile Store:

SELECT to_json(_pfreportingonprod) FROM decile_attribute_dataset LIMIT 1;

The results are:

The primary identity for this attribute record is [email protected]. The decile bucket is 1 and Yearly_Purchase_Dollars is 1.

If we interrogate the Profile store by doing the following:

The results are the following:

Exercise

Now that you uploaded the entire dataset into Real-Time Customer Profile, which dimension can you use to split this dataset into multiple audiences based on the decile bucketing?

Appendix: Reconciling Identities Linked to the Same Profile in Multiple Purchase Datasets

The above modeling assumes that each row is unique and that a yearly purchase can be assigned to each of them. This will be the case when you have an email address or the CRM ID acting as a source of truth for reconciled data from the backend systems. Or you have used the Identity Lookup table as mentioned in the appendix section below to reconcile the identities of the same customer across multiple datasets.

If that data is not reconciled, then you would need to reconcile that data. The more fragmented this information, the worse it gets for you and your company. Imagine an e-commerce system that tracks online transactions with an email address as the primary identifier and a CRM system that centralizes all of the transactions including those in the e-commerce system and those which are offline. You need to be careful to ensure you are not counting a purchase transaction twice. From a sales reporting standpoint, this would get even worse.

If you are reconciling purchase data from multiple datasets that have different identities, then you have to generate the identity lookup table from the Profile snapshot attribute data for any merge policy. As long as the merge policy selected has identity stitching enabled, the identity graph will be the same for all such merge policies as the Real-Time Customer Profile has a single identity graph in the system.

You will need to do multiple joins across these datasets with the identity lookup table while grouping the results by the unique profile ID that was generated. In fact, you can use custom logic to prioritize which values you want to ingest from the datasets as the source of truth. Please read the documentation on the creation of the identity lookup table:

IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller

Last updated 5 months ago