Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Overview
  • Random Forest Regression Model
  • Rule-Based Labeling for Conversion Scoring: Automating Data Annotation with Data Distiller
  • Build the Random Forest Model
  • Model Evaluation
  • Predictions
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller

Transform Data Into Action: Predict, Personalize, Prosper!

Last updated 6 months ago

Prerequisites

Download the following datasets

Ingest the above datasets using:

Make sure you have read

Overview

Businesses aim to optimize marketing efforts by identifying customer behaviors that lead to conversions (e.g., purchases). Using SQL-based feature engineering and a Random Forest model, we can analyze user interactions, extract actionable insights, and predict the likelihood of conversions.

A retail company tracks website activity, including page views, purchases, and campaign interactions. They want to:

  1. Understand Customer Behavior: Analyze aggregated session data such as visit frequency, page views, and campaign participation.

  2. Predict Conversions: Use historical data to predict whether a specific user interaction will result in a purchase.

  3. Optimize Engagement: Focus marketing campaigns and resources on high-conversion-probability customers to maximize ROI.

Random Forest Regression Model

A Random Forest is an ensemble machine learning algorithm that uses multiple decision trees to make predictions. It is a type of supervised learning algorithm widely employed for both classification and regression tasks. By combining the predictions of several decision trees, Random Forest enhances accuracy and reduces the risk of overfitting, making it a robust and reliable choice for a variety of machine learning problems.

The algorithm works by constructing multiple decision trees during training. Each tree is trained on a random subset of the data and features, a technique known as bagging. For classification problems, Random Forest aggregates the predictions of individual trees using majority voting. In regression problems, it averages the predictions across trees to determine the final output. By selecting random subsets of features for training, Random Forest reduces the correlation between individual trees, leading to improved overall prediction accuracy.

In this use case, the goal is to predict the score of user conversion based on web event data. Random Forest is particularly well-suited to this scenario for several reasons. First, it handles mixed data types seamlessly. The dataset contains both categorical features, such as browser and campaign IDs, and numerical features, like page views and purchases. Random Forest accommodates these variations without requiring extensive preprocessing.

Additionally, Random Forest is robust against noise and overfitting. Web activity data often contains irrelevant features or noisy observations. By averaging predictions across trees, the algorithm reduces the influence of noisy data and avoids overfitting, ensuring more reliable predictions. Furthermore, Random Forest provides valuable insights into feature importance, helping to identify which factors, such as page views or campaign IDs, contribute most significantly to user conversions.

Another advantage of Random Forest is its ability to model non-linear relationships. User conversion likelihood is often influenced by complex interactions between features. Random Forest captures these relationships effectively without requiring explicit feature engineering. The algorithm is also scalable, capable of handling large datasets with millions of user sessions, thanks to its parallel computation capabilities.

Random Forest is flexible for regression tasks, which is crucial for this use case where the target variable is a conversion score between 0 and 1. Its inherent design makes it ideal for predicting continuous outcomes. In contrast, a single decision tree, while simpler, is prone to overfitting, especially in datasets with many features and potential noise. Random Forest mitigates this limitation by averaging the predictions of multiple trees, delivering more generalizable and robust results.

Rule-Based Labeling for Conversion Scoring: Automating Data Annotation with Data Distiller

Using SQL transformations to encode features and prepare the dataset:

-- Create a transformed dataset
CREATE TABLE transformed_webevents AS
SELECT
    visit_id,
    UPPER(country_cd) AS country_encode,
    campaign_id,
    browser_id,
    operating_system_id,
    COUNT(*) AS visits,
    SUM(pageviews) AS total_pageviews,
    SUM(purchases) AS total_purchases,
    CASE 
        WHEN SUM(purchases) > 0 THEN 1
        ELSE 0
    END AS converted
FROM webevents_train
GROUP BY visit_id, country_cd, campaign_id, browser_id, operating_system_id;

Note that

  • string_indexer encodes categorical features (visit_id, country_cd, campaign_id, browser_id, operating_system_id).

  • vector_assembler combines encoded categorical and numerical features (visits, pageviews, purchases) into a single feature vector.

  • standard_scaler scales this feature vector to normalize values for training and enhance model performance.

Note that we are using a simple CASE statement to assign a score in our data

Loss of Nuance: By converting the target variable to a binary 0 or 1, we may lose information about the magnitude of purchases. For instance, a user with one purchase is treated the same as a user with multiple purchases. In cases where we want to predict the extent of engagement or the volume of purchases, this binary target may not capture the full range of user behavior.

Suitability for Regression: Since we are using random forest regression, which is typically better suited for continuous targets, applying it to a binary target might not be ideal. Random forest regression will still function, but it may not fully leverage the model’s strengths in predicting continuous outcomes. If our primary goal is to predict conversion likelihood (0 or 1), a classifier like random forest classification might be more appropriate.

Alternatives: If we have access to more granular data on the number of purchases, we could consider using a different target variable that reflects this information, such as the count of purchases or the monetary value of purchases. Using a continuous target with random forest regression could enable the model to capture the full range of behaviors, giving us insights into not just who is likely to convert but also to what extent they engage in purchases. Alternatively, if our primary objective is binary conversion prediction, we could use a random forest classifier to better align with the binary nature of our target.

Build the Random Forest Model

CREATE MODEL random_forest_model
TRANSFORM (
    string_indexer(visit_id) AS si_id,
    string_indexer(country_encode) AS country_code,
    string_indexer(campaign_id) AS campaign_encode,
    string_indexer(browser_id) AS browser_encode,
    string_indexer(operating_system_id) AS os_encode,
    vector_assembler(array(si_id, country_code, campaign_encode, browser_encode, os_encode, visits, total_pageviews, total_purchases)) AS features,
    standard_scaler(features) AS scaled_features
)
OPTIONS (
    MODEL_TYPE = 'random_forest_regression',
    NUM_TREES = 20,
    MAX_DEPTH = 5,
    LABEL = 'converted'
)
AS
SELECT *
FROM transformed_webevents;

The result will be:

Model Evaluation

Evaluate the model using test data:

SELECT * 
FROM model_evaluate(
    random_forest_model,
    1, -- Validation split percentage (1 for 100% evaluation on provided data)
    SELECT
        visit_id,
        country_cd AS country_encode,
        campaign_id,
        browser_id,
        operating_system_id,
        COUNT(*) AS visits,
        SUM(pageviews) AS total_pageviews,
        SUM(purchases) AS total_purchases,
        CASE 
            WHEN SUM(purchases) > 0 THEN 1
            ELSE 0
        END AS converted
    FROM webevents_test
    GROUP BY 
        visit_id, 
        country_cd, 
        campaign_id, 
        browser_id, 
        operating_system_id
);

The results are:

Here's what each metric means in the context of your Random Forest model evaluation:

Root Mean Squared Error (RMSE): RMSE is a metric that measures the average magnitude of the errors between the predicted values and the actual values in your test dataset. It is the square root of the average squared differences between predictions and actuals. In this case, an RMSE of 0.048 indicates that the model's predictions are, on average, about 0.048 away from the actual conversion likelihood values. Since RMSE is on the same scale as the target variable (in this case, a probability score between 0 and 1 for conversion likelihood), a lower RMSE suggests that the model's predictions are relatively accurate.

R-squared (R²): R², or the coefficient of determination, measures the proportion of variance in the dependent variable (conversion likelihood) that is predictable from the independent variables (features). An R² value of 0.9907 indicates that the model explains approximately 99.07% of the variance in the conversion likelihoods. This is a high R² value, which suggests that the model fits the data very well and that the features used in the model account for almost all of the variability in conversion outcomes.

Overall Evaluation

  • Model Accuracy: The combination of a low RMSE and a high R² value suggests that your Random Forest model is performing exceptionally well in predicting conversion likelihood.

  • Suitability for Use: These results indicate that the model is reliable for predicting conversions based on the test dataset, and it is likely capturing meaningful patterns in the data

If this performance holds across additional data (e.g., an inference dataset or real-world data), the model can be a valuable tool for predicting user conversions and guiding targeted marketing efforts. However, it’s essential to validate the model with real-world data periodically, as models trained on historical data may degrade in accuracy over time.

Predictions

Use the model for prediction on new data:

SELECT * 
FROM model_predict(
    random_forest_model,
    1, -- Validation split percentage (1 for 100% evaluation on provided data)
    SELECT
        visit_id,
        country_cd AS country_encode,
        campaign_id,
        browser_id,
        operating_system_id,
        COUNT(*) AS visits,
        SUM(pageviews) AS total_pageviews,
        SUM(purchases) AS total_purchases,
        CASE 
            WHEN SUM(purchases) > 0 THEN 1
            ELSE 0
        END AS converted
    FROM webevents_inference
    GROUP BY 
        visit_id, 
        country_cd, 
        campaign_id, 
        browser_id, 
        operating_system_id
);
PREP 500: Ingesting CSV Data into Adobe Experience Platform
STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
379KB
webevents_train.csv
108KB
webevents_test.csv
54KB
webevents_inference.csv
Creating thee feature set.
ML model is created.
Results of the evaluation.
Predictions
Page cover image