Page cover

STATSML 604: Data Exploration for Customer AI in Real-Time Customer Data Platform

Prerequisites

EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics DataEXPLORE 201: Exploring Web Analytics Data with Data DistillerENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller

Hello World Query

Before we proceed, this is how the demo_data_intelligent_services_demo_midvalues look from an ingestion point of view:

Experience event dataset

Note that this dataset has not been enabled for Real-Time Customer Profile, meaning it is not being ingested into the profile.

This is how the schema looks:

Schema has standard field groups that resemble those in Adobe Analytics schema. Some of these standard field groups will be used by Customer AI.

Run the following on the Experience Event dataset, ensuring it adheres to either the Adobe Analytics Schema or the Consumer Event Schema. Keep in mind that Customer AI automatically generates features using standard field groups.

SELECT * FROM demo_data_intelligent_services_demo_midvalues;

Your result should look something like this:

A simple SELECT query does not reveal much.

To get the JSON structure of the output use:

SELECT to_json(STRUCT(*)) AS json_output
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;
JSON output

Accessing the Standard Field Groups Used in Customer AI

Customer AI uses standard field groups to automatically generate features such as recency, frequency, and engagement metrics without manual intervention. In addition to these standard field groups, custom events can be incorporated for advanced customization, allowing for more tailored insights. While it is not necessary for the data to include all field groups, having relevant ones significantly enhances model performance.

Standard Field Group
Event Types
Common Fields
Description

Commerce

Purchases, Product Views, Checkouts

productListItems.SKU, commerce.order.purchaseID, commerce.purchases.value

Captures transaction-related data for commerce activities.

Web

Web Visits, Page Views, Link Clicks

web.webPageDetails.name, web.webInteraction.linkClicks.value

Tracks website interactions and user behaviors online.

Application

App Installs, Launches, Feature Usage

application.name, application.installs.value, application.featureUsages.value

Focuses on mobile or desktop application interactions.

Search

Search Queries

search.keywords

Logs search behavior and keywords used by customers.

Profile Attributes

Customer Demographics, Preferences

person.name, person.gender, person.birthDate

Provides demographic and user profile information.

Device

Device Details

device.type, device.operatingSystem.name

Identifies devices used by the customer during interactions.

IdentityMap

Identity Resolution

identityMap.ECID.id, identityMap.AAID.id

Links different identifiers for a unified customer view.

Experience Event Metadata

Metadata Tracking

timestamp, channel, environment

Provides contextual metadata about customer events.

You can access the standard fields by executing the following:

SELECT to_json(web) AS web_json
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;
SELECT to_json(productListItems) AS productListItems_json
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;
SELECT to_json(commerce) AS commerce_json
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;
commerce fieldgroup details
SELECT to_json(application) AS application_json
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;
SELECT to_json(search) AS search_json
FROM demo_data_intelligent_services_demo_midvalues
LIMIT 1;

Flattening Standard Fields

First try running something like this, a template that has all the fields:

SELECT
    -- Web Interaction Details
    web.webPageDetails.name AS page_name,
    web.webInteraction.linkClicks.value AS link_clicks,

    -- Commerce Details
    commerce.purchases.value AS purchase_value,
    commerce.order.purchaseID AS purchase_id,
    commerce.checkouts.value AS checkout_value,
    commerce.productListViews.value AS product_list_views,
    commerce.productListOpens.value AS product_list_opens,
    commerce.productListRemovals.value AS product_list_removals,
    commerce.productViews.value AS product_views,
    productListItems.SKU AS product_sku,

    -- Application Details
    application.name AS application_name,
    application.applicationCloses.value AS app_closes,
    application.crashes.value AS app_crashes,
    application.featureUsages.value AS feature_usages,
    application.firstLaunches.value AS first_launches,

    -- Search Information
    search.keywords AS search_keywords,

    -- Event Metadata
    meta.intendedToExtend AS intended_to_extend,

    -- Time Period
    startDate,
    endDate

FROM
    demo_data_intelligent_services_demo_midvalues;

In my case, I will get an error that says:

Last error code that says no viable alternative at input

The error message:

ErrorCode: 42601 ... no viable alternative at input 'commerce.order'

This suggests that commerce.order doesn't exist. The key part to notice is no viable alternative at input 'commerce.order'

Another error message that you will get which is also indicative of the same error is:

ErrorCode: 08P01 queryId: 83370942-ffd7-4aa3-9f54-22b1edd06c56 Unknown error encountered. Reason: [[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `meta`.`intendedtoextend` cannot be resolved. Did you mean one of the following? [`demo_data_intelligent_services_demo_midvalues`.`_id`, `demo_data_intelligent_services_demo_midvalues`.`web`, `demo_data_intelligent_services_demo_midvalues`.`media`, `demo_data_intelligent_services_demo_midvalues`.`device`, `demo_data_intelligent_services_demo_midvalues`.`search`].; line 12 pos 2; 'GlobalLimit 50000 +- 'LocalLimit 50000 +- 'Project [web#503298.webpagedetails

This pattern will repeat for each missing column if you keep removing or commenting them manually.

Let us execute the query after commenting out the missing columns:

SELECT
    -- Web Interaction Details
    web.webPageDetails.name AS page_name,
    web.webInteraction.linkClicks.value AS link_clicks,

    -- Commerce Details
    commerce.purchases.value AS purchase_value,
    -- commerce.order.purchaseID AS purchase_id,  -- COMMENTED OUT (missing column)
    commerce.checkouts.value AS checkout_value,
    commerce.productListViews.value AS product_list_views,
    commerce.productListOpens.value AS product_list_opens,
    commerce.productListRemovals.value AS product_list_removals,
    commerce.productViews.value AS product_views,
    productListItems.SKU AS product_sku,

    -- Application Details
    -- application.name AS application_name,
    -- application.applicationCloses.value AS app_closes,
    -- application.crashes.value AS app_crashes,
    -- application.featureUsages.value AS feature_usages,
    -- application.firstLaunches.value AS first_launches,

    -- Search Information
    search.keywords AS search_keywords

    -- Event Metadata
    -- meta.intendedToExtend AS intended_to_extend,

    -- Time Period
    -- startDate,
    -- endDate

FROM demo_data_intelligent_services_demo_midvalues;

Observe that the comma after search_keywords was removed as it is the last column.

There could be data quality issues that we need to investigate

Data Quality Score

The Data Quality Score (DQS) is a composite metric designed to measure how reliable, complete, and consistent data is within a dataset. The goal is to quantify data quality so that issues can be identified and improvements can be tracked over time.

We evaluated data quality based on three core dimensions:

Completeness

  • The proportion of non-null (non-missing) values in the dataset. Missing data can skew analyses, leading to biased insights.

  • Formula:

Completeness (%)=(1Null ValuesTotal Records)×100\text{Completeness (\%)} = \left(1 - \frac{\text{Null Values}}{\text{Total Records}}\right) \times 100

Uniqueness

  • The proportion of distinct (unique) values relative to the total number of records. Ensures data is free from duplicates, which can distort aggregations or counts.

  • Formula:

Uniqueness (%)=Distinct ValuesTotal Records×100\text{Uniqueness (\%)} = \frac{\text{Distinct Values}}{\text{Total Records}} \times 100

Validity

  • Measures if the data conforms to expected formats, ranges, or patterns. Invalid data (e.g., negative prices, malformed dates) can break business rules.

  • Formula:

Validity (%)=Valid RecordsTotal Records×100\text{Validity (\%)} = \frac{\text{Valid Records}}{\text{Total Records}} \times 100

We average the data quality metrics to provide a balanced view, ensuring that no single metric dominates the overall score unless explicitly weighted. This approach maintains fairness across different dimensions of data quality. However, flexible weighting can be applied when necessary. In certain contexts, such as financial data, specific dimensions like validity might carry more weight due to their critical importance in ensuring data accuracy and compliance.

Here is the query that you should execute:

WITH data_quality AS (
    SELECT
        -- Web Interaction Details (Completeness & Uniqueness)
        (1 - (COUNT(CASE WHEN web.webPageDetails.name IS NULL THEN 1 END) / COUNT(*))) * 100 AS page_name_completeness,
        (COUNT(DISTINCT web.webPageDetails.name) / COUNT(*)) * 100 AS page_name_uniqueness,

        (1 - (COUNT(CASE WHEN web.webInteraction.linkClicks.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS link_clicks_completeness,
        (COUNT(DISTINCT web.webInteraction.linkClicks.value) / COUNT(*)) * 100 AS link_clicks_uniqueness,

        -- Commerce Details (Completeness, Uniqueness, Validity)
        (1 - (COUNT(CASE WHEN commerce.purchases.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS purchase_value_completeness,
        (COUNT(DISTINCT commerce.purchases.value) / COUNT(*)) * 100 AS purchase_value_uniqueness,
        (COUNT(CASE WHEN commerce.purchases.value >= 0 THEN 1 END) / COUNT(*)) * 100 AS purchase_value_validity,

        -- Commented Section for commerce.order
        -- (1 - (COUNT(CASE WHEN commerce.order.purchaseID IS NULL THEN 1 END) / COUNT(*))) * 100 AS purchase_id_completeness,
        -- (COUNT(DISTINCT commerce.order.purchaseID) / COUNT(*)) * 100 AS purchase_id_uniqueness,

        (1 - (COUNT(CASE WHEN commerce.checkouts.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS checkout_value_completeness,
        (COUNT(DISTINCT commerce.checkouts.value) / COUNT(*)) * 100 AS checkout_value_uniqueness,

        (1 - (COUNT(CASE WHEN commerce.productListViews.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS product_list_views_completeness,
        (COUNT(DISTINCT commerce.productListViews.value) / COUNT(*)) * 100 AS product_list_views_uniqueness,

        (1 - (COUNT(CASE WHEN commerce.productListOpens.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS product_list_opens_completeness,
        (COUNT(DISTINCT commerce.productListOpens.value) / COUNT(*)) * 100 AS product_list_opens_uniqueness,

        (1 - (COUNT(CASE WHEN commerce.productListRemovals.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS product_list_removals_completeness,
        (COUNT(DISTINCT commerce.productListRemovals.value) / COUNT(*)) * 100 AS product_list_removals_uniqueness,

        (1 - (COUNT(CASE WHEN commerce.productViews.value IS NULL THEN 1 END) / COUNT(*))) * 100 AS product_views_completeness,
        (COUNT(DISTINCT commerce.productViews.value) / COUNT(*)) * 100 AS product_views_uniqueness,

        (1 - (COUNT(CASE WHEN productListItems.SKU IS NULL THEN 1 END) / COUNT(*))) * 100 AS product_sku_completeness,
        (COUNT(DISTINCT productListItems.SKU) / COUNT(*)) * 100 AS product_sku_uniqueness,
        (COUNT(CASE WHEN SIZE(productListItems.SKU) > 0 THEN 1 END) / COUNT(*)) * 100 AS product_sku_validity,

        -- Search Information
        (1 - (COUNT(CASE WHEN search.keywords IS NULL THEN 1 END) / COUNT(*))) * 100 AS search_keywords_completeness,
        (COUNT(DISTINCT search.keywords) / COUNT(*)) * 100 AS search_keywords_uniqueness
    FROM demo_data_intelligent_services_demo_midvalues
)

SELECT 'page_name' AS column_name, (page_name_completeness + page_name_uniqueness) / 2 AS data_quality_score FROM data_quality
UNION ALL
SELECT 'link_clicks', (link_clicks_completeness + link_clicks_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'purchase_value', (purchase_value_completeness + purchase_value_uniqueness + purchase_value_validity) / 3 FROM data_quality
UNION ALL
SELECT 'checkout_value', (checkout_value_completeness + checkout_value_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_views', (product_list_views_completeness + product_list_views_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_opens', (product_list_opens_completeness + product_list_opens_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_list_removals', (product_list_removals_completeness + product_list_removals_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_views', (product_views_completeness + product_views_uniqueness) / 2 FROM data_quality
UNION ALL
SELECT 'product_sku', (product_sku_completeness + product_sku_uniqueness + product_sku_validity) / 3 FROM data_quality
UNION ALL
SELECT 'search_keywords', (search_keywords_completeness + search_keywords_uniqueness) / 2 FROM data_quality;

The results are:

Data Quality Score of the Fields

Recency, Frequency & Monetary Modeling

RFM modeling is a powerful customer segmentation technique that helps businesses understand and predict customer behavior based on three key metrics:

  • Recency (R): How recently a customer performed an action (e.g., last purchase, last visit).

  • Frequency (F): How often the customer performs the action within a specific timeframe.

  • Monetary (M): How much the customer has spent over a period of time.

In traditional marketing and customer analytics, these metrics help identify high-value customers, predict churn, and personalize marketing strategies.

Survival Analysis Principles and Propensity Modeling

In Customer AI, we're tasked with predicting the propensity of an event occurring within the next N days, such as a customer making a purchase or engaging with a product. At first glance, this might seem like a straightforward classification problem—did the customer convert or not? However, the underlying mechanics of how we compute this propensity are deeply influenced by survival analysis principles, even if we're not explicitly running survival models.

Survival analysis is fundamentally about estimating the probability that an event has not occurred yet by a certain time, represented by the survival function S(t). In the context of Customer AI, when we calculate the propensity to convert in the next N days, we're essentially working with 1−S(N)—the probability that the customer will convert within that time frame. This is where the illusion comes into play: while we might not explicitly model S(t), the features we engineer, such as Recency (R) and Frequency (F), are designed to behave as proxies that capture the dynamics of time-to-event data, just like survival analysis would.

Recency (R) acts as an implicit measure of the time since the last event, closely tied to the hazard function h(t) in survival analysis, which represents the instantaneous risk of an event occurring at time t. The more recent the engagement, the higher the implied hazard or conversion risk. Similarly, Frequency (F) reflects the accumulated risk over time, akin to the cumulative hazard function H(t). Customers with frequent engagements are treated as having a higher cumulative risk of conversion because their repeated actions signal strong intent.

By feeding R and F into machine learning models like XGBoost, we are essentially embedding these survival-based risk factors into the model’s decision-making process. The model learns to associate recent, frequent behaviors with higher propensities to convert, mimicking the effects of survival functions without explicitly modeling them. This approach allows us to handle large-scale behavioral data efficiently while still leveraging the time-sensitive nature of customer actions, which is the core strength of survival analysis. In essence, we’re creating an illusion of survival modeling—using its principles to shape our features and predictions, even though we’re technically solving a classification problem.

A Note on Monetary Value (M)

While Monetary (M) is a critical component of traditional RFM (Recency, Frequency, Monetary) modeling, it is not used natively in Customer AI. This is because Customer AI is designed to predict future customer behavior, such as conversions or churn, with a strong emphasis on engagement patterns rather than historical spending. Behavioral signals like Recency (R) and Frequency (F) are more dynamic and time-sensitive, making them better aligned with predictive models that rely on survival analysis principles. Additionally, monetary data often suffers from inconsistency across platforms, especially when customers engage through multiple channels, making it less reliable for direct inclusion in propensity models.

However, if businesses wish to incorporate Monetary (M) into Customer AI for advanced segmentation, it can be added as a Profile Attribute. This approach is particularly useful for use cases like lifetime value (LTV) prediction or revenue-based customer segmentation, where understanding the financial impact of customer behavior is critical. By complementing the existing propensity models with monetary data, organizations can gain deeper insights into not just who is likely to convert, but also which customers are likely to bring the most value. This dual-layer analysis helps in optimizing marketing strategies, resource allocation, and personalized customer engagement.

Attribute Assessment for RFM

A look at the table shows you which of the attributes are suitable or not suitable

Field
Data Quality Score (%)
Recency Suitability
Frequency Suitability
Monetary Suitability

page_name

39.25

✅ Strong (Tracks last page viewed)

✅ Strong (Counts page visits)

❌ Not Applicable

product_sku

38.29

✅ Strong (Last product interaction)

✅ Strong (Product interaction counts)

❌ Not Applicable

product_views

3.22

⚠️ Moderate (Recent view may be missing)

⚠️ Moderate (Some views might be missed)

❌ Not Applicable

link_clicks

3.12

⚠️ Moderate (Recent clicks tracked inconsistently)

⚠️ Moderate (Click counts may be incomplete)

❌ Not Applicable

product_list_views

1.94

❌ Weak (Incomplete last view tracking)

❌ Weak (Inconsistent counts)

❌ Not Applicable

product_list_opens

1.82

❌ Weak (Missing last open data)

❌ Weak (Sparse event tracking)

❌ Not Applicable

checkout_value

1.22

❌ Poor (Sparse events, weak recency)

❌ Poor (Low frequency, missing events)

✅ Applicable (Tracks transaction amounts)

search_keywords

1.09

❌ Weak (Limited search tracking)

❌ Weak (Few search event records)

❌ Not Applicable

purchase_value

0.54

❌ Very Poor (Critical purchase data missing)

❌ Very Poor (Few transactions captured)

✅ Applicable (Key for monetary analysis)

product_list_removals

0.11

❌ Extremely Poor (Unreliable recency data)

❌ Extremely Poor (Event counts unreliable)

❌ Not Applicable

Monetary Value as a Profile Attribute

To calculate the Monetary (M) value and add it to the Profile, we do the following

Based on the JSON structure:

  • commerce.productListViews.value

  • commerce.productListRemovals.value

  • commerce.order.purchaseID

  • productListItems.SKU

For Monetary (M), we will consider the commerce.order section, focusing on:

  • purchaseID (to identify transactions)

  • productListItems.SKU (to track purchased items)

  • commerce.purchases.value (if available) or aggregate values from transactions.

-- Step 1: Extract relevant transaction data
CREATE OR REPLACE VIEW order_data AS
SELECT 
    identityMap.ECID.id AS ecid,
    commerce.order.purchaseID AS purchase_id,
    productListItems.SKU AS sku,
    commerce.purchases.value AS purchase_value,
    TO_DATE(_acp_system_metadata.timestamp) AS purchase_date
FROM demo_data_intelligent_services_demo_midvalues
WHERE commerce.order.purchaseID IS NOT NULL;

-- Step 2: Aggregate the total monetary value per user
CREATE OR REPLACE VIEW monetary_aggregation AS
SELECT 
    ecid,
    SUM(CASE 
            WHEN purchase_value IS NOT NULL THEN purchase_value 
            ELSE 0 
        END) AS total_monetary_value
FROM order_data
GROUP BY ecid;

-- Step 3: Create the profile table to store monetary value
CREATE TABLE IF NOT EXISTS adls_profile_monetary (
    ecid TEXT PRIMARY IDENTITY NAMESPACE 'ECID',
    total_monetary_value DECIMAL(18, 2)
)
WITH (LABEL = 'PROFILE');

-- Step 4: Insert aggregated data into the profile table
INSERT INTO adls_profile_monetary
SELECT 
    STRUCT(
        ecid,
        total_monetary_value
    ) AS profile_data
FROM monetary_aggregation; 

Example Workflow

Download the following file:

by following the steps here:

PREP 500: Ingesting CSV Data into Adobe Experience Platform
-- Step 1: Extract relevant transaction data
CREATE OR REPLACE VIEW order_data AS
SELECT 
    ECID AS ecid,
    purchaseID AS purchase_id,
    SKU AS sku,
    purchase_value AS purchase_value,
    TO_DATE(timestamp) AS purchase_date
FROM commerce_data
WHERE purchaseID IS NOT NULL;

-- Step 2: Aggregate the total monetary value per user
CREATE OR REPLACE VIEW monetary_aggregation AS
SELECT 
    ecid,
    SUM(CASE 
            WHEN purchase_value IS NOT NULL THEN purchase_value 
            ELSE 0 
        END) AS total_monetary_value
FROM order_data
GROUP BY ecid;

-- Step 3: Create the profile table to store monetary value
CREATE TABLE IF NOT EXISTS adls_profile_monetary (
    ecid TEXT PRIMARY IDENTITY NAMESPACE 'ECID',
    total_monetary_value DECIMAL(18, 2)
)
WITH (LABEL = 'PROFILE');

-- Step 4: Insert aggregated data into the profile table
INSERT INTO adls_profile_monetary
SELECT 
    STRUCT(
        ecid,
        total_monetary_value
    ) AS profile_data
FROM monetary_aggregation;

In Step 3, the SQL code creates a table named adls_profile_monetary to store the aggregated monetary values for each customer. The ecid (Experience Cloud ID) serves as the primary identifier, ensuring each customer's data remains unique within the 'ECID' namespace. This is critical where identity resolution and profile unification rely on consistent identifiers. The total_monetary_value column captures the cumulative spending of each customer, formatted as a decimal to handle currency values accurately. The WITH (LABEL = 'PROFILE') clause designates the table as part of the Real-Time Customer Profile, enabling seamless integration with audience segmentation, personalization, and activation workflows.

In Step 4, the aggregated data from the monetary_aggregation view is inserted into the newly created profile table. The STRUCT function packages the ecid and its corresponding total_monetary_value into a structured format compatible with profile-based systems. This approach ensures that monetary values are not just stored but are readily available for real-time analytics and targeting. By centralizing this data at the profile level, marketers can effortlessly identify high-value customers, create personalized offers, and drive data-driven marketing strategies based on customers' historical spending behavior.

Customer AI Setup

Last updated