Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Case Study Overview
  • Your Assignment
  • Dataset Overview
  • Logistic Regression and Its Connection to Probabilities
  • Difference Between Logistic and Numerical Regression
  • Framing Logistic Regression as Numerical Optimization
  • Data Exploration
  • Active Customers
  • Active Customers Lag
  • Customer Retention
  • Feature Engineering
  • Model Creation
  • Parameters
  • Model Evaluation
  • Model Prediction
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression

Unlocking Future Engagement: Data-Driven Retention Predictions for Smarter Personalization Strategies

Last updated 6 months ago

Prerequisites

You need to download this datasets:

To ingest the above datasets, you need to be familiar with this tutorial:

We will be using DBVisualizer for this tutorial but you can use anything you like:

Case Study Overview

A major app company aimed to understand and improve user retention by forecasting future retention rates based on historical app usage patterns. By leveraging data-driven insights, the company sought to proactively address churn risks, optimize user engagement strategies, and ultimately increase long-term loyalty. Using SQL and logistic regression, a model was developed to predict which users would likely remain active in the coming period, enabling the company to take targeted actions to retain valuable users.

Customer retention is crucial for sustainable growth, especially in highly competitive markets. By accurately predicting whether customers will remain engaged from one period to the next, you can optimize marketing campaigns, personalize customer experiences, and focus retention efforts where they are most effective. This model provides such insights by examining customer activity patterns and creating retention forecasts for the next period, aligning with the company’s goal of boosting loyalty. By focusing on customers predicted to have a lower probability of retention, the company can take proactive actions—like sending personalized messages, offering discounts, or recommending relevant content—to improve engagement and loyalty.

Your Assignment

You will use historical activity data to create a dataset that captures customer activity over weekly periods. Using Data Distiller, you will transform this data to identify active customers for each period and calculate key features:

  • Current Customers (active in the current period)

  • Previous Customers (active in the prior period)

  • Retention Rate (ratio of retained customers to current customers)

You will use these features as inputs for a logistic regression model, which will predict retention as a probabilistic outcome, offering a forecast of whether each customer would stay active in the next period.

Dataset Overview

  • Time Span: 3 years (from January 1, 2023, to December 28, 2025)

  • Frequency: Weekly activity records

  • Initial Customers: 20,000 customers at the start of the simulation

  • Total Weeks: 156 weeks (3 years * 52 weeks per year)

  • Total Records: The dataset contains approximately 2 to 3 million records, depending on retention and churn rates.

  • The dataset is stored in a CSV file with the following columns:

    • timestamp: A datetime object representing the start date of the week during which the customer was active.

      • Format: "YYYY-MM-DD"

      • Example: "2023-01-01", "2023-01-08"

    • customer_id: A unique identifier for each customer.

      • Format: Original customers: "CUST_<number>"

      • New customers added: "NEW_CUST_<week_number>_<number>"

      • Example: "CUST_1", "NEW_CUST_0_1", "NEW_CUST_52_150"

Logistic Regression and Its Connection to Probabilities

Imagine you want to predict whether a customer will stay with a company or leave next month. This is a yes-or-no question, so you’re looking for a result that tells you one of two things: "stay" or "leave." This is where logistic regression comes in. It's a statistical method that helps us predict outcomes like this by calculating the probability of each option.

Logistic regression is a technique used to predict binary outcomes — situations with only two possible results, like "yes" or "no," "win" or "lose," "stay" or "leave." Instead of giving you a number or a continuous outcome (like predicting a future sales amount), logistic regression tells you the likelihood (probability) of an outcome.

Why Not Use Regular Math or Linear Regression?

Here’s where logistic regression is clever. It doesn’t just use a straight line (like you would in simple math or linear regression). Instead, it applies a special mathematical function called the sigmoid function. This function takes any input, positive or negative, and compresses it into a range between 0 and 1. This transformation gives us probabilities, making the predictions easier to interpret as likelihoods.

You can frame logistic regression as a numerical (or continuous optimization) problem, but it fundamentally differs from standard numerical regression methods like linear regression because of its underlying assumptions and the type of outcome variable it predicts.

Difference Between Logistic and Numerical Regression

  1. Target Variable:

    • Logistic regression is used for categorical outcomes (e.g., 0 or 1).

    • Numerical regression (e.g., linear regression) predicts continuous outcomes (e.g., price, temperature).

  2. Model Assumptions:

    • Logistic regression assumes the log-odds of the target variable are linearly related to the features.

    • Linear regression assumes the dependent variable itself is linearly related to the features.

  3. Output:

    • Logistic regression outputs probabilities that are thresholded to classify outcomes.

    • Numerical regression outputs continuous predictions directly.

Framing Logistic Regression as Numerical Optimization

The core mechanism of logistic regression differs significantly from numerical regression because logistic regression optimizes a likelihood function, whereas numerical regression minimizes a root mean square error (RMSE) function. The objective function and the predictions are done in the following way:

  1. Continuous Loss Function: Logistic regression minimizes a loss function called the negative log-likelihood (or equivalently, maximizes the likelihood of the observed data). This is a continuous optimization problem, typically solved using gradient-based methods like gradient descent. The loss function for binary logistic regression is:

L(β)=−∑i=1N[yilog⁡(p^i)+(1−yi)log⁡(1−p^i)]L(\beta) = -\sum_{i=1}^N \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right] L(β)=−i=1∑N​[yi​log(p^​i​)+(1−yi​)log(1−p^​i​)]
  • L(β): The total negative log-likelihood we aim to minimize.

  • N: Number of observations (samples) in the dataset.

  • y​: Binary target variable for the i-th observation

  • phat​: Predicted probability that yi=1y_i = 1yi​=1, based on the sigmoid function.

  • x​: Feature vector for the iii-th observation.

  • β: Coefficient vector (weights of the model).

  1. Predicting Probabilities: Logistic regression predicts probabilities using the sigmoid (logistic) function, which maps the linear combination of features

p^i=11+e−xi⋅β\hat{p}_i = \frac{1}{1 + e^{-\mathbf{x}_i \cdot \beta}} p^​i​=1+e−xi​⋅β1​

While the coefficients are fit numerically, the predicted outcomes themselves are probabilities, not continuous values.

Data Exploration

Active Customers

Active Customers are customers who had some activity in the app, in a specific time period (e.g., weekly). This calculation helps us determine the number of unique customers interacting with the app in each period.

WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period
    FROM customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
active_customers AS (
    SELECT period, customer_id
    FROM activity_periods
)
SELECT 
    period,
    COUNT(DISTINCT customer_id) AS active_customers
FROM 
    active_customers
GROUP BY 
    period
ORDER BY 
    period;
  • activity_periods groups activity by each customer and weekly period.

  • active_customers then selects unique customer_ids for each period, representing those customers who were active during that week.

If you execute this query, you will get:

Active Customers Lag

Active Customers Lag introduces the prior period data for each customer. Using Data Distiller's LAG function, we access each customer’s previous active period, allowing us to see which customers were active in both the current and prior periods.

WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period
    FROM customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
active_customers AS (
    SELECT period, customer_id
    FROM activity_periods
),
active_customers_lag AS (
    SELECT
        customer_id,
        period,
        LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS previous_period
    FROM active_customers
)
SELECT 
    customer_id,
    period,
    previous_period
FROM 
    active_customers_lag
ORDER BY 
    customer_id, period;
  • LAG(period) retrieves the prior active period for each customer, enabling comparisons across consecutive periods.

  • If previous_period is not null, it indicates the customer was active in both the current and previous periods, meaning they were "retained."

If you execute this query, you should get:

Customer Retention

Customer retention generally indicates that a customer has remained active or engaged with the business over multiple time periods. For our use case, retention is defined more specifically as a customer who is active in both the current period and the previous period. Here’s a closer look at how this can be interpreted and extended:

  1. Period-to-Period Retention (Binary Retention): In our case study, a customer is considered retained if they were active both in the current period and the immediately preceding period. This is calculated by checking if a customer’s last activity occurred both in the current period and the previous one, often using the SQL LAG() function to compare periods.

  2. Alternative Multi-Period Definitions:

    • Consecutive Period Retention: Some businesses define retention over multiple consecutive periods. For example, a customer might be retained if they were active for three consecutive periods.

    • Flexible Retention Periods: A customer may also be considered retained if they were active in any two out of the last three periods. This approach adds flexibility and can be beneficial for businesses where customer activity are less frequent.

  3. Cohort-Based Retention: Instead of looking at immediate consecutive periods, retention can be defined within monthly or quarterly cohorts. For instance, you may consider a customer retained if they were active in the current quarter after making at least one activity in the previous quarter, which can be valuable for tracking longer-term engagement.

  4. Retention Rate is typically defined as the ratio of customers retained (those active in consecutive periods) to the total number of customers in the previous period. This provides an overall measure of how well the app is retaining its customer base.

  5. Churn Rate is often defined as the inverse of retention, i.e., customers who were active in the previous period but not in the current one. This allows businesses to identify the rate at which customers stop engaging.

We need to combine the data on active customers and their previous periods to calculate:

  • Current Customers: Customers active in the current period.

  • Previous Customers: Customers active in the previous period.

  • Retained Customers: Customers active in both the current and previous periods.

These calculations yield metrics like churn rate (customers who did not return) and retention rate (customers who stayed from one period to the next):

  • Churn Rate: Calculated as 1.0 - (current_customers / previous_customers) to capture the percentage of customers lost in the current period.

  • Retention Rate: Calculated as retained_customers / current_customers to understand how many customers from the previous period were retained.

Note that Retention Rate and Churn Rate are not the same, although they are related metrics. They measure different aspects of customer behavior, and their calculations reflect this.

Churn Rate measures the percentage of customers lost from one period to the next. This metric focuses on the loss of customers. If many customers from the previous period are not active in the current period, the churn rate will be high. A churn rate of 0 indicates no customers were lost, while a churn rate of 1 indicates all customers were lost.

Retention Rate measures the percentage of customers from the previous period who are still active in the current period. This metric focuses on the retention of customers. It shows how well the business is keeping its customers engaged over time. A retention rate of 1 means all customers from the previous period stayed active, while a retention rate closer to 0 means most previous customers were lost.

Suppose:

  • previous_customers = 100

  • current_customers = 80

  • retained_customers = 60 (these are customers who were active in both periods)

Churn Rate = 1- (80/100)= 20%

Retention Rate = 60/80=75%

In other words, 60 customers remained active from the previous period to the current one. Starting with 100 customers, this number decreased to 80 in the current period, meaning we lost 40 customers but also gained 20 new ones. Churn is the net loss of customers, calculated as the difference between the customers lost (40) and new customers gained (20), which is 20. Expressed as a percentage of the starting count (100), this gives us the churn rate of 20%. This calculation focuses on the overall change in the customer base, rather than specifically measuring customers retained from one period to the next.

WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period
    FROM customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
active_customers AS (
    SELECT period, customer_id
    FROM activity_periods
),
active_customers_lag AS (
    SELECT
        customer_id,
        period,
        LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS previous_period
    FROM active_customers
),
churn_retention AS (
    SELECT
        period,
        COUNT(DISTINCT customer_id) AS current_customers,
        COUNT(DISTINCT CASE WHEN previous_period IS NULL THEN NULL ELSE customer_id END) AS previous_customers,
        COUNT(DISTINCT CASE WHEN previous_period IS NOT NULL THEN customer_id END) AS retained_customers
    FROM active_customers_lag
    GROUP BY period
)
SELECT
    period,
    current_customers,
    previous_customers,
    -- Calculate churn_rate as a float, bounded within [0, 1]
    CAST(CASE WHEN previous_customers = 0 THEN 0 
              ELSE LEAST(1.0 - (current_customers * 1.0 / previous_customers), 1.0) END AS FLOAT) AS churn_rate,
    -- Calculate retention_rate as a float, bounded within [0, 1]
    CAST(CASE WHEN current_customers = 0 THEN 0 
              ELSE LEAST(retained_customers * 1.0 / current_customers, 1.0) END AS FLOAT) AS retention_rate
FROM churn_retention
ORDER BY period;
  • current_customers counts customers in each period.

  • previous_customers counts those who were active in the previous period (customers with non-null previous_period).

  • retained_customers counts those who appear in both the current and previous periods, allowing us to compute the retention rate and churn rate.

Feature Engineering

Our goal is to develop features at the individual customer level and monitor their metrics on a weekly basis. Designing these features requires some creativity, but with a solid understanding of customer analytics, you should be able to make insightful and effective choices.

  1. Tenure Weeks (tenure_weeks): The total number of weeks a customer has been active up to the current period: It reflects customer loyalty and experience with the service. Customers with longer tenures may have established habits or preferences influencing retention.

  2. Activity Frequency (activity_frequency): The number of activities or transactions the customer performed in the current week. It indicates the customer's engagement level during the current period. Higher activity may correlate with increased satisfaction and likelihood of retention.

  3. Average Engagement (average_engagement): The average number of activities per week up to the current period. It provides context for the customer's activity frequency. It helps identify deviations from typical behavior (e.g., a sudden drop in engagement).

  4. Time Since Last Activity (time_since_last_activity_days): The number of days since the customer's previous activity period. It measures recency of engagement. Longer durations may signal disengagement or risk of churn.

  5. Retained Next Week (retained_next_week): A binary label indicating whether the customer was active in the following week (1 for yes, 0 for no). It serves as the target variable for the classification model. It helps the model learn patterns associated with customer retention.

The main goal of feature engineering in this context is to transform raw activity data into meaningful features that can help predict customer retention. By capturing various aspects of customer behavior, we aim to:

  • Understand Engagement Patterns: Features like activity_frequency and average_engagement reflect how customers interact with the service over time.

  • Identify Risk of Churn: Features like time_since_last_activity_days can indicate when a customer is becoming less engaged.

  • Capture Customer Loyalty: tenure_weeks highlights how long a customer has been with the service, which may influence their retention behavior.

  • Provide Predictive Insights: These features allow the machine learning model to recognize patterns that precede retention or churn.

All the features are given below in a single query but you can highlight and execute the SELECT queries within each CTE.

Also observe how the features that we computed above are being cast into numeric representations as the logistics model in Data Distiller can only accept numerical values and we have to make that explicit:

    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days,
    CAST(retained_next_week AS INT) AS retained_next_week

The features that we will extract are the following:

WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period,
        COUNT(*) AS activity_count
    FROM customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
    SELECT
        customer_id,
        period,
        activity_count,
        LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS next_period,
        LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS previous_period
    FROM activity_periods
),
customer_retention AS (
    SELECT
        customer_id,
        period,
        activity_count,
        CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS retained_next_week,
        previous_period
    FROM customer_activity
),
customer_features AS (
    SELECT
        cr.customer_id,
        cr.period,
        cr.retained_next_week,
        -- Tenure Weeks
        COUNT(*) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS tenure_weeks,
        -- Activity Frequency
        cr.activity_count AS activity_frequency,
        -- Average Engagement
        AVG(cr.activity_count) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS average_engagement,
        -- Time Since Last Activity
        COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS time_since_last_activity_days
    FROM customer_retention cr
)
SELECT
    customer_id,
    period,
    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days,
    CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL;

Model Creation

Logistic regression is a supervised machine learning algorithm used for binary or multi-class classification tasks. Its primary goal is to predict the probability of an instance belonging to a specific class, typically modeled using the sigmoid function to map inputs to probabilities.

Parameters

  • MAX_ITER:

    • Maximum number of iterations for the optimization algorithm.

    • Possible Values: ≥0

    • Default: 100

  • REGPARAM:

    • Regularization parameter to prevent overfitting by penalizing large coefficients.

    • Possible Values: ≥0

    • Default: 0.0

  • ELASTICNETPARAM:

    • ElasticNet mixing parameter (α\alphaα):

      • α=0: Applies L2 regularization (Ridge).

      • α=1: Applies L1 regularization (Lasso).

      • Possible Values: 0≤α≤10

      • Default: 0.0

Logistic regression offers several benefits compared to other classifiers, making it a popular choice for many classification tasks. It is straightforward and highly interpretable, as it provides a clear relationship between input features and the predicted probability through its coefficients, which represent the impact of each feature on the log-odds of the outcome. Unlike some classifiers, such as decision trees, logistic regression outputs probabilities, allowing for flexible thresholding for classification or ranking tasks. It is particularly efficient on small to moderately sized datasets where linear separability is a reasonable assumption. Built-in support for L1 (Lasso) and L2 (Ridge) regularization makes logistic regression robust to overfitting, especially in high-dimensional datasets, while its computational cost is low compared to more complex models like random forests or neural networks. Often used as a baseline model, logistic regression offers simplicity and reasonable performance across a wide range of problems.

Let's take the code we previously created to extract the features and label, and simply add the CREATE MODEL, TRANSFORM, and OPTIONS clauses.

The code below creates a logistic regression model named Retention_model_logistic_reg designed to predict customer retention. Here's a breakdown of each part:

DROP MODEL IF EXISTS Retention_model_logistic_reg;

CREATE MODEL Retention_model_logistic_reg
TRANSFORM(
    vector_assembler(array(
        tenure_weeks,
        activity_frequency,
        average_engagement,
        time_since_last_activity_days
    )) AS features
)
OPTIONS(
    MODEL_TYPE='logistic_reg',
    LABEL='retained_next_week'
)
AS
WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period,
        COUNT(*) AS activity_count
    FROM customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
    SELECT
        customer_id,
        period,
        activity_count,
        LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS next_period,
        LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS previous_period
    FROM activity_periods
),
customer_retention AS (
    SELECT
        customer_id,
        period,
        activity_count,
        CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS retained_next_week,
        previous_period
    FROM customer_activity
),
customer_features AS (
    SELECT
        cr.customer_id,
        cr.period,
        cr.retained_next_week,
        -- Tenure Weeks
        COUNT(*) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS tenure_weeks,
        -- Activity Frequency
        cr.activity_count AS activity_frequency,
        -- Average Engagement
        AVG(cr.activity_count) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS average_engagement,
        -- Time Since Last Activity
        COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS time_since_last_activity_days
    FROM customer_retention cr
)
SELECT
    customer_id,
    period,
    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days,
    CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL;

1. CREATE MODEL Retention_model_logistic_reg

This statement creates a new machine learning model named Retention_model_logistic_reg. The model is intended for predicting customer retention, specifically whether a customer will be active in the next week. Naming the model allows you to reference it later for evaluation, prediction, or deployment tasks.

2. TRANSFORM(vector_assembler(array(tenure_weeks, activity_frequency, average_engagement, time_since_last_activity_days)) AS features)

This line defines how the input data is prepared before being fed into the logistic regression model. Here's a breakdown:

  • vector_assembler:

    • A function that combines multiple columns into a single feature vector.

    • Essential for machine learning algorithms that require input features in vector form.

  • array(tenure_weeks, activity_frequency, average_engagement, time_since_last_activity_days):

    • Specifies the columns to be combined into the feature vector. The columns are:

      1. tenure_weeks:

        • Represents the number of weeks the customer has been active up to the current period.

        • Reflects customer loyalty and tenure.

      2. activity_frequency:

        • The number of activities or transactions the customer had in the current period.

        • Indicates the current engagement level.

      3. average_engagement:

        • The average number of activities per period up to the current period.

        • Provides insight into the customer's typical engagement over time.

      4. time_since_last_activity_days:

        • The number of days since the customer's last activity period.

        • Measures recency of engagement, which can be a predictor of churn.

If you execute the query, you will get the following:

Model Evaluation

Let's run the model evaluation on the test_customer_activities dataset (make sure you have uploaded this dataset onto AEP)

SELECT * 
FROM model_evaluate(Retention_model_logistic_reg, 1,
WITH activity_periods AS (
    SELECT
        customer_id,
        DATE_TRUNC('week', timestamp) AS period,
        COUNT(*) AS activity_count
    FROM test_customer_activities
    GROUP BY customer_id, DATE_TRUNC('week', timestamp)
),
customer_activity AS (
    SELECT
        customer_id,
        period,
        activity_count,
        LEAD(period) OVER (PARTITION BY customer_id ORDER BY period) AS next_period,
        LAG(period) OVER (PARTITION BY customer_id ORDER BY period) AS previous_period
    FROM activity_periods
),
customer_retention AS (
    SELECT
        customer_id,
        period,
        activity_count,
        CASE WHEN next_period = DATE_ADD(period, 7) THEN 1 ELSE 0 END AS retained_next_week,
        previous_period
    FROM customer_activity
),
customer_features AS (
    SELECT
        cr.customer_id,
        cr.period,
        cr.retained_next_week,
        -- Tenure Weeks
        COUNT(*) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS tenure_weeks,
        -- Activity Frequency
        cr.activity_count AS activity_frequency,
        -- Average Engagement
        AVG(cr.activity_count) OVER (
            PARTITION BY cr.customer_id 
            ORDER BY cr.period
        ) AS average_engagement,
        -- Time Since Last Activity
        COALESCE(DATEDIFF(cr.period, cr.previous_period), 0) AS time_since_last_activity_days
    FROM customer_retention cr
)
SELECT
    customer_id,
    period,
    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days,
    CAST(retained_next_week AS INT) AS retained_next_week
FROM customer_features
WHERE retained_next_week IS NOT NULL);

In the above screenshot, the model_evaluate function has returned four evaluation metrics for the logistic regression model:

  1. AUC_ROC (Area Under the Receiver Operating Characteristic Curve): AUC-ROC is a measure of the model's ability to distinguish between classes (in this case, "retained" vs. "not retained"). This AUC-ROC score of 0.787 suggests that the model has a reasonably good ability to distinguish between the "retained" and "not retained" classes. An AUC-ROC of 0.5 would mean random guessing, so a score of 0.787 indicates the model is performing better than random but still has room for improvement in separating the two classes.

  2. Accuracy: Accuracy is the proportion of correct predictions out of the total predictions made by the model. With an accuracy of 85.61%, the model is correctly classifying a high percentage of instances overall. However, it’s worth noting that accuracy alone doesn’t capture the balance between false positives and false negatives, which is why precision and recall are also important.

  3. Precision: Precision (or Positive Predictive Value) measures the accuracy of the positive predictions. The precision score of 0.7283 indicates that when the model predicts a customer will be "retained," it is correct 72.83% of the time. This score is helpful in understanding the model’s reliability for positive predictions (i.e., predicting retention). It is calculated as:

Precision=True PositivesTrue Positives+False Positives\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} Precision=True Positives+False PositivesTrue Positives​
  1. Recall: Recall (or Sensitivity) measures how well the model captures all actual positive instances. A recall score of 0.8627 means that the model correctly identifies 86.27% of all actual retained customers. This high recall indicates that the model is good at capturing most of the retained customers but might be allowing some false positives. It is calculated as:

Recall=True PositivesTrue Positives+False Negatives\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} Recall=True Positives+False NegativesTrue Positives​

This model has a decent balance between precision and recall, with a solid AUC-ROC score suggesting effective classification. Fine-tuning (adjusting the various parameters in the options of this model) could further improve the balance between precision and recall, depending on whether it’s more critical to avoid false positives or false negatives in the retention context.

Model Prediction

Once the model has been trained and evaluated, it can be used to make predictions on new data in customer_inference_dataset

SELECT *
FROM model_predict(Retention_model_logistic_reg, 1,
SELECT
    customer_id,
    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days
FROM customer_inference_dataset);

The results would be:

You can go ahead and materialize the dataset if you like:

CREATE TABLE retention_prediction AS
SELECT *
FROM model_predict(Retention_model_logistic_reg, 1,
SELECT
    customer_id,
    CAST(tenure_weeks AS DOUBLE) AS tenure_weeks,
    CAST(activity_frequency AS DOUBLE) AS activity_frequency,
    CAST(average_engagement AS DOUBLE) AS average_engagement,
    CAST(time_since_last_activity_days AS DOUBLE) AS time_since_last_activity_days
FROM customer_inference_dataset);
PREP 500: Ingesting CSV Data into Adobe Experience Platform
PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
10MB
customer_activities.csv
15MB
test_customer_activities.csv
8KB
customer_inference_dataset.csv
Active customer list.
Customer activity lag.
Retentiion rate along with the current and past customer periods.
Feature engineering at each customer level.
ML model has been created.
Model evaluation on test data.
Inferencing on the logistics model
Page cover image