STATSML 604: Data Exploration for Customer AI in Real-Time Customer Data Platform
Last updated
Last updated
Before we proceed, this is how the demo_data_intelligent_services_demo_midvalues look from an ingestion point of view:
Note that this dataset has not been enabled for Real-Time Customer Profile, meaning it is not being ingested into the profile.
This is how the schema looks:
Run the following on the Experience Event dataset, ensuring it adheres to either the Adobe Analytics Schema or the Consumer Event Schema. Keep in mind that Customer AI automatically generates features using standard field groups.
Your result should look something like this:
To get the JSON structure of the output use:
Customer AI uses standard field groups to automatically generate features such as recency, frequency, and engagement metrics without manual intervention. In addition to these standard field groups, custom events can be incorporated for advanced customization, allowing for more tailored insights. While it is not necessary for the data to include all field groups, having relevant ones significantly enhances model performance.
Commerce
Purchases, Product Views, Checkouts
productListItems.SKU, commerce.order.purchaseID, commerce.purchases.value
Captures transaction-related data for commerce activities.
Web
Web Visits, Page Views, Link Clicks
web.webPageDetails.name, web.webInteraction.linkClicks.value
Tracks website interactions and user behaviors online.
Application
App Installs, Launches, Feature Usage
application.name, application.installs.value, application.featureUsages.value
Focuses on mobile or desktop application interactions.
Search
Search Queries
search.keywords
Logs search behavior and keywords used by customers.
Profile Attributes
Customer Demographics, Preferences
person.name, person.gender, person.birthDate
Provides demographic and user profile information.
Device
Device Details
device.type, device.operatingSystem.name
Identifies devices used by the customer during interactions.
IdentityMap
Identity Resolution
identityMap.ECID.id, identityMap.AAID.id
Links different identifiers for a unified customer view.
Experience Event Metadata
Metadata Tracking
timestamp, channel, environment
Provides contextual metadata about customer events.
You can access the standard fields by executing the following:
First try running something like this, a template that has all the fields:
In my case, I will get an error that says:
The error message:
This suggests that commerce.order
doesn't exist. The key part to notice is no viable alternative at input 'commerce.order'
Another error message that you will get which is also indicative of the same error is:
This pattern will repeat for each missing column if you keep removing or commenting them manually.
Let us execute the query after commenting out the missing columns:
Observe that the comma after search_keywords
was removed as it is the last column.
The Data Quality Score (DQS) is a composite metric designed to measure how reliable, complete, and consistent data is within a dataset. The goal is to quantify data quality so that issues can be identified and improvements can be tracked over time.
We evaluated data quality based on three core dimensions:
The proportion of non-null (non-missing) values in the dataset. Missing data can skew analyses, leading to biased insights.
Formula:
The proportion of distinct (unique) values relative to the total number of records. Ensures data is free from duplicates, which can distort aggregations or counts.
Formula:
Measures if the data conforms to expected formats, ranges, or patterns. Invalid data (e.g., negative prices, malformed dates) can break business rules.
Formula:
We average the data quality metrics to provide a balanced view, ensuring that no single metric dominates the overall score unless explicitly weighted. This approach maintains fairness across different dimensions of data quality. However, flexible weighting can be applied when necessary. In certain contexts, such as financial data, specific dimensions like validity might carry more weight due to their critical importance in ensuring data accuracy and compliance.
Here is the query that you should execute:
The results are:
RFM modeling is a powerful customer segmentation technique that helps businesses understand and predict customer behavior based on three key metrics:
Recency (R): How recently a customer performed an action (e.g., last purchase, last visit).
Frequency (F): How often the customer performs the action within a specific timeframe.
Monetary (M): How much the customer has spent over a period of time.
In traditional marketing and customer analytics, these metrics help identify high-value customers, predict churn, and personalize marketing strategies.
In Customer AI, we're tasked with predicting the propensity of an event occurring within the next N days, such as a customer making a purchase or engaging with a product. At first glance, this might seem like a straightforward classification problem—did the customer convert or not? However, the underlying mechanics of how we compute this propensity are deeply influenced by survival analysis principles, even if we're not explicitly running survival models.
Survival analysis is fundamentally about estimating the probability that an event has not occurred yet by a certain time, represented by the survival function S(t). In the context of Customer AI, when we calculate the propensity to convert in the next N days, we're essentially working with 1−S(N)—the probability that the customer will convert within that time frame. This is where the illusion comes into play: while we might not explicitly model S(t), the features we engineer, such as Recency (R) and Frequency (F), are designed to behave as proxies that capture the dynamics of time-to-event data, just like survival analysis would.
Recency (R) acts as an implicit measure of the time since the last event, closely tied to the hazard function h(t) in survival analysis, which represents the instantaneous risk of an event occurring at time t. The more recent the engagement, the higher the implied hazard or conversion risk. Similarly, Frequency (F) reflects the accumulated risk over time, akin to the cumulative hazard function H(t). Customers with frequent engagements are treated as having a higher cumulative risk of conversion because their repeated actions signal strong intent.
By feeding R and F into machine learning models like XGBoost, we are essentially embedding these survival-based risk factors into the model’s decision-making process. The model learns to associate recent, frequent behaviors with higher propensities to convert, mimicking the effects of survival functions without explicitly modeling them. This approach allows us to handle large-scale behavioral data efficiently while still leveraging the time-sensitive nature of customer actions, which is the core strength of survival analysis. In essence, we’re creating an illusion of survival modeling—using its principles to shape our features and predictions, even though we’re technically solving a classification problem.
While Monetary (M) is a critical component of traditional RFM (Recency, Frequency, Monetary) modeling, it is not used natively in Customer AI. This is because Customer AI is designed to predict future customer behavior, such as conversions or churn, with a strong emphasis on engagement patterns rather than historical spending. Behavioral signals like Recency (R) and Frequency (F) are more dynamic and time-sensitive, making them better aligned with predictive models that rely on survival analysis principles. Additionally, monetary data often suffers from inconsistency across platforms, especially when customers engage through multiple channels, making it less reliable for direct inclusion in propensity models.
However, if businesses wish to incorporate Monetary (M) into Customer AI for advanced segmentation, it can be added as a Profile Attribute. This approach is particularly useful for use cases like lifetime value (LTV) prediction or revenue-based customer segmentation, where understanding the financial impact of customer behavior is critical. By complementing the existing propensity models with monetary data, organizations can gain deeper insights into not just who is likely to convert, but also which customers are likely to bring the most value. This dual-layer analysis helps in optimizing marketing strategies, resource allocation, and personalized customer engagement.
A look at the table shows you which of the attributes are suitable or not suitable
page_name
39.25
✅ Strong (Tracks last page viewed)
✅ Strong (Counts page visits)
❌ Not Applicable
product_sku
38.29
✅ Strong (Last product interaction)
✅ Strong (Product interaction counts)
❌ Not Applicable
product_views
3.22
⚠️ Moderate (Recent view may be missing)
⚠️ Moderate (Some views might be missed)
❌ Not Applicable
link_clicks
3.12
⚠️ Moderate (Recent clicks tracked inconsistently)
⚠️ Moderate (Click counts may be incomplete)
❌ Not Applicable
product_list_views
1.94
❌ Weak (Incomplete last view tracking)
❌ Weak (Inconsistent counts)
❌ Not Applicable
product_list_opens
1.82
❌ Weak (Missing last open data)
❌ Weak (Sparse event tracking)
❌ Not Applicable
checkout_value
1.22
❌ Poor (Sparse events, weak recency)
❌ Poor (Low frequency, missing events)
✅ Applicable (Tracks transaction amounts)
search_keywords
1.09
❌ Weak (Limited search tracking)
❌ Weak (Few search event records)
❌ Not Applicable
purchase_value
0.54
❌ Very Poor (Critical purchase data missing)
❌ Very Poor (Few transactions captured)
✅ Applicable (Key for monetary analysis)
product_list_removals
0.11
❌ Extremely Poor (Unreliable recency data)
❌ Extremely Poor (Event counts unreliable)
❌ Not Applicable
To calculate the Monetary (M) value and add it to the Profile, we do the following
Based on the JSON structure:
commerce.productListViews.value
commerce.productListRemovals.value
commerce.order.purchaseID
productListItems.SKU
For Monetary (M), we will consider the commerce.order
section, focusing on:
purchaseID
(to identify transactions)
productListItems.SKU
(to track purchased items)
commerce.purchases.value
(if available) or aggregate values from transactions.
Download the following file:
by following the steps here:
In Step 3, the SQL code creates a table named adls_profile_monetary
to store the aggregated monetary values for each customer. The ecid
(Experience Cloud ID) serves as the primary identifier, ensuring each customer's data remains unique within the 'ECID' namespace. This is critical where identity resolution and profile unification rely on consistent identifiers. The total_monetary_value
column captures the cumulative spending of each customer, formatted as a decimal to handle currency values accurately. The WITH (LABEL = 'PROFILE')
clause designates the table as part of the Real-Time Customer Profile, enabling seamless integration with audience segmentation, personalization, and activation workflows.
In Step 4, the aggregated data from the monetary_aggregation
view is inserted into the newly created profile table. The STRUCT
function packages the ecid
and its corresponding total_monetary_value
into a structured format compatible with profile-based systems. This approach ensures that monetary values are not just stored but are readily available for real-time analytics and targeting. By centralizing this data at the profile level, marketers can effortlessly identify high-value customers, create personalized offers, and drive data-driven marketing strategies based on customers' historical spending behavior.