STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
Unlocking Future Engagement: Data-Driven Retention Predictions for Smarter Personalization Strategies
Last updated
Unlocking Future Engagement: Data-Driven Retention Predictions for Smarter Personalization Strategies
Last updated
You need to download this datasets:
To ingest the above datasets, you need to be familiar with this tutorial:
We will be using DBVisualizer for this tutorial but you can use anything you like:
A major app company aimed to understand and improve user retention by forecasting future retention rates based on historical app usage patterns. By leveraging data-driven insights, the company sought to proactively address churn risks, optimize user engagement strategies, and ultimately increase long-term loyalty. Using SQL and logistic regression, a model was developed to predict which users would likely remain active in the coming period, enabling the company to take targeted actions to retain valuable users.
Customer retention is crucial for sustainable growth, especially in highly competitive markets. By accurately predicting whether customers will remain engaged from one period to the next, you can optimize marketing campaigns, personalize customer experiences, and focus retention efforts where they are most effective. This model provides such insights by examining customer activity patterns and creating retention forecasts for the next period, aligning with the company’s goal of boosting loyalty. By focusing on customers predicted to have a lower probability of retention, the company can take proactive actions—like sending personalized messages, offering discounts, or recommending relevant content—to improve engagement and loyalty.
You will use historical activity data to create a dataset that captures customer activity over weekly periods. Using Data Distiller, you will transform this data to identify active customers for each period and calculate key features:
Current Customers (active in the current period)
Previous Customers (active in the prior period)
Retention Rate (ratio of retained customers to current customers)
You will use these features as inputs for a logistic regression model, which will predict retention as a probabilistic outcome, offering a forecast of whether each customer would stay active in the next period.
Time Span: 3 years (from January 1, 2023, to December 28, 2025)
Frequency: Weekly activity records
Initial Customers: 20,000 customers at the start of the simulation
Total Weeks: 156 weeks (3 years * 52 weeks per year)
Total Records: The dataset contains approximately 2 to 3 million records, depending on retention and churn rates.
The dataset is stored in a CSV file with the following columns:
timestamp
: A datetime object representing the start date of the week during which the customer was active.
Format: "YYYY-MM-DD"
Example: "2023-01-01"
, "2023-01-08"
customer_id
: A unique identifier for each customer.
Format: Original customers: "CUST_<number>"
New customers added: "NEW_CUST_<week_number>_<number>"
Example: "CUST_1"
, "NEW_CUST_0_1"
, "NEW_CUST_52_150"
Imagine you want to predict whether a customer will stay with a company or leave next month. This is a yes-or-no question, so you’re looking for a result that tells you one of two things: "stay" or "leave." This is where logistic regression comes in. It's a statistical method that helps us predict outcomes like this by calculating the probability of each option.
Logistic regression is a technique used to predict binary outcomes — situations with only two possible results, like "yes" or "no," "win" or "lose," "stay" or "leave." Instead of giving you a number or a continuous outcome (like predicting a future sales amount), logistic regression tells you the likelihood (probability) of an outcome.
Here’s where logistic regression is clever. It doesn’t just use a straight line (like you would in simple math or linear regression). Instead, it applies a special mathematical function called the sigmoid function. This function takes any input, positive or negative, and compresses it into a range between 0 and 1. This transformation gives us probabilities, making the predictions easier to interpret as likelihoods.
You can frame logistic regression as a numerical (or continuous optimization) problem, but it fundamentally differs from standard numerical regression methods like linear regression because of its underlying assumptions and the type of outcome variable it predicts.
Target Variable:
Logistic regression is used for categorical outcomes (e.g., 0 or 1).
Numerical regression (e.g., linear regression) predicts continuous outcomes (e.g., price, temperature).
Model Assumptions:
Logistic regression assumes the log-odds of the target variable are linearly related to the features.
Linear regression assumes the dependent variable itself is linearly related to the features.
Output:
Logistic regression outputs probabilities that are thresholded to classify outcomes.
Numerical regression outputs continuous predictions directly.
The core mechanism of logistic regression differs significantly from numerical regression because logistic regression optimizes a likelihood function, whereas numerical regression minimizes a root mean square error (RMSE) function. The objective function and the predictions are done in the following way:
Continuous Loss Function: Logistic regression minimizes a loss function called the negative log-likelihood (or equivalently, maximizes the likelihood of the observed data). This is a continuous optimization problem, typically solved using gradient-based methods like gradient descent. The loss function for binary logistic regression is:
L(β): The total negative log-likelihood we aim to minimize.
N
: Number of observations (samples) in the dataset.
y
: Binary target variable for the i-th observation
phat
: Predicted probability that yi=1y_i = 1yi=1, based on the sigmoid function.
x
: Feature vector for the iii-th observation.
β: Coefficient vector (weights of the model).
Predicting Probabilities: Logistic regression predicts probabilities using the sigmoid (logistic) function, which maps the linear combination of features
While the coefficients are fit numerically, the predicted outcomes themselves are probabilities, not continuous values.
Active Customers are customers who had some activity in the app, in a specific time period (e.g., weekly). This calculation helps us determine the number of unique customers interacting with the app in each period.
activity_periods
groups activity by each customer and weekly period.
active_customers
then selects unique customer_id
s for each period, representing those customers who were active during that week.
If you execute this query, you will get:
Active Customers Lag introduces the prior period data for each customer. Using Data Distiller's LAG
function, we access each customer’s previous active period, allowing us to see which customers were active in both the current and prior periods.
LAG(period)
retrieves the prior active period for each customer, enabling comparisons across consecutive periods.
If previous_period
is not null, it indicates the customer was active in both the current and previous periods, meaning they were "retained."
If you execute this query, you should get:
Customer retention generally indicates that a customer has remained active or engaged with the business over multiple time periods. For our use case, retention is defined more specifically as a customer who is active in both the current period and the previous period. Here’s a closer look at how this can be interpreted and extended:
Period-to-Period Retention (Binary Retention): In our case study, a customer is considered retained if they were active both in the current period and the immediately preceding period. This is calculated by checking if a customer’s last activity occurred both in the current period and the previous one, often using the SQL LAG()
function to compare periods.
Alternative Multi-Period Definitions:
Consecutive Period Retention: Some businesses define retention over multiple consecutive periods. For example, a customer might be retained if they were active for three consecutive periods.
Flexible Retention Periods: A customer may also be considered retained if they were active in any two out of the last three periods. This approach adds flexibility and can be beneficial for businesses where customer activity are less frequent.
Cohort-Based Retention: Instead of looking at immediate consecutive periods, retention can be defined within monthly or quarterly cohorts. For instance, you may consider a customer retained if they were active in the current quarter after making at least one activity in the previous quarter, which can be valuable for tracking longer-term engagement.
Retention Rate is typically defined as the ratio of customers retained (those active in consecutive periods) to the total number of customers in the previous period. This provides an overall measure of how well the app is retaining its customer base.
Churn Rate is often defined as the inverse of retention, i.e., customers who were active in the previous period but not in the current one. This allows businesses to identify the rate at which customers stop engaging.
We need to combine the data on active customers and their previous periods to calculate:
Current Customers: Customers active in the current period.
Previous Customers: Customers active in the previous period.
Retained Customers: Customers active in both the current and previous periods.
These calculations yield metrics like churn rate (customers who did not return) and retention rate (customers who stayed from one period to the next):
Churn Rate: Calculated as 1.0 - (current_customers / previous_customers)
to capture the percentage of customers lost in the current period.
Retention Rate: Calculated as retained_customers / current_customers
to understand how many customers from the previous period were retained.
Note that Retention Rate and Churn Rate are not the same, although they are related metrics. They measure different aspects of customer behavior, and their calculations reflect this.
Churn Rate measures the percentage of customers lost from one period to the next. This metric focuses on the loss of customers. If many customers from the previous period are not active in the current period, the churn rate will be high. A churn rate of 0 indicates no customers were lost, while a churn rate of 1 indicates all customers were lost.
Retention Rate measures the percentage of customers from the previous period who are still active in the current period. This metric focuses on the retention of customers. It shows how well the business is keeping its customers engaged over time. A retention rate of 1 means all customers from the previous period stayed active, while a retention rate closer to 0 means most previous customers were lost.
Suppose:
previous_customers = 100
current_customers = 80
retained_customers = 60 (these are customers who were active in both periods)
Churn Rate = 1- (80/100)= 20%
Retention Rate = 60/80=75%
In other words, 60 customers remained active from the previous period to the current one. Starting with 100 customers, this number decreased to 80 in the current period, meaning we lost 40 customers but also gained 20 new ones. Churn is the net loss of customers, calculated as the difference between the customers lost (40) and new customers gained (20), which is 20. Expressed as a percentage of the starting count (100), this gives us the churn rate of 20%. This calculation focuses on the overall change in the customer base, rather than specifically measuring customers retained from one period to the next.
current_customers
counts customers in each period.
previous_customers
counts those who were active in the previous period (customers with non-null previous_period
).
retained_customers
counts those who appear in both the current and previous periods, allowing us to compute the retention rate and churn rate.
Our goal is to develop features at the individual customer level and monitor their metrics on a weekly basis. Designing these features requires some creativity, but with a solid understanding of customer analytics, you should be able to make insightful and effective choices.
Tenure Weeks (tenure_weeks
): The total number of weeks a customer has been active up to the current period: It reflects customer loyalty and experience with the service. Customers with longer tenures may have established habits or preferences influencing retention.
Activity Frequency (activity_frequency
): The number of activities or transactions the customer performed in the current week. It indicates the customer's engagement level during the current period. Higher activity may correlate with increased satisfaction and likelihood of retention.
Average Engagement (average_engagement
): The average number of activities per week up to the current period. It provides context for the customer's activity frequency. It helps identify deviations from typical behavior (e.g., a sudden drop in engagement).
Time Since Last Activity (time_since_last_activity_days
): The number of days since the customer's previous activity period. It measures recency of engagement. Longer durations may signal disengagement or risk of churn.
Retained Next Week (retained_next_week
): A binary label indicating whether the customer was active in the following week (1
for yes, 0
for no). It serves as the target variable for the classification model. It helps the model learn patterns associated with customer retention.
The main goal of feature engineering in this context is to transform raw activity data into meaningful features that can help predict customer retention. By capturing various aspects of customer behavior, we aim to:
Understand Engagement Patterns: Features like activity_frequency
and average_engagement
reflect how customers interact with the service over time.
Identify Risk of Churn: Features like time_since_last_activity_days
can indicate when a customer is becoming less engaged.
Capture Customer Loyalty: tenure_weeks
highlights how long a customer has been with the service, which may influence their retention behavior.
Provide Predictive Insights: These features allow the machine learning model to recognize patterns that precede retention or churn.
All the features are given below in a single query but you can highlight and execute the SELECT
queries within each CTE.
Also observe how the features that we computed above are being cast into numeric representations as the logistics model in Data Distiller can only accept numerical values and we have to make that explicit:
The features that we will extract are the following:
Logistic regression is a supervised machine learning algorithm used for binary or multi-class classification tasks. Its primary goal is to predict the probability of an instance belonging to a specific class, typically modeled using the sigmoid function to map inputs to probabilities.
MAX_ITER:
Maximum number of iterations for the optimization algorithm.
Possible Values: ≥0
Default: 100
REGPARAM:
Regularization parameter to prevent overfitting by penalizing large coefficients.
Possible Values: ≥0
Default: 0.0
ELASTICNETPARAM:
ElasticNet mixing parameter (α\alphaα):
α=0: Applies L2 regularization (Ridge).
α=1: Applies L1 regularization (Lasso).
Possible Values: 0≤α≤10
Default: 0.0
Logistic regression offers several benefits compared to other classifiers, making it a popular choice for many classification tasks. It is straightforward and highly interpretable, as it provides a clear relationship between input features and the predicted probability through its coefficients, which represent the impact of each feature on the log-odds of the outcome. Unlike some classifiers, such as decision trees, logistic regression outputs probabilities, allowing for flexible thresholding for classification or ranking tasks. It is particularly efficient on small to moderately sized datasets where linear separability is a reasonable assumption. Built-in support for L1 (Lasso) and L2 (Ridge) regularization makes logistic regression robust to overfitting, especially in high-dimensional datasets, while its computational cost is low compared to more complex models like random forests or neural networks. Often used as a baseline model, logistic regression offers simplicity and reasonable performance across a wide range of problems.
Let's take the code we previously created to extract the features and label, and simply add the CREATE MODEL
, TRANSFORM
, and OPTIONS
clauses.
The code below creates a logistic regression model named Retention_model_logistic_reg
designed to predict customer retention. Here's a breakdown of each part:
1. CREATE MODEL Retention_model_logistic_reg
This statement creates a new machine learning model named Retention_model_logistic_reg
. The model is intended for predicting customer retention, specifically whether a customer will be active in the next week. Naming the model allows you to reference it later for evaluation, prediction, or deployment tasks.
2. TRANSFORM(vector_assembler(array(tenure_weeks, activity_frequency, average_engagement, time_since_last_activity_days)) AS features)
This line defines how the input data is prepared before being fed into the logistic regression model. Here's a breakdown:
vector_assembler
:
A function that combines multiple columns into a single feature vector.
Essential for machine learning algorithms that require input features in vector form.
array(tenure_weeks, activity_frequency, average_engagement, time_since_last_activity_days)
:
Specifies the columns to be combined into the feature vector. The columns are:
tenure_weeks
:
Represents the number of weeks the customer has been active up to the current period.
Reflects customer loyalty and tenure.
activity_frequency
:
The number of activities or transactions the customer had in the current period.
Indicates the current engagement level.
average_engagement
:
The average number of activities per period up to the current period.
Provides insight into the customer's typical engagement over time.
time_since_last_activity_days
:
The number of days since the customer's last activity period.
Measures recency of engagement, which can be a predictor of churn.
If you execute the query, you will get the following:
Let's run the model evaluation on the test_customer_activities
dataset (make sure you have uploaded this dataset onto AEP)
In the above screenshot, the model_evaluate
function has returned four evaluation metrics for the logistic regression model:
AUC_ROC (Area Under the Receiver Operating Characteristic Curve): AUC-ROC is a measure of the model's ability to distinguish between classes (in this case, "retained" vs. "not retained"). This AUC-ROC score of 0.787 suggests that the model has a reasonably good ability to distinguish between the "retained" and "not retained" classes. An AUC-ROC of 0.5 would mean random guessing, so a score of 0.787 indicates the model is performing better than random but still has room for improvement in separating the two classes.
Accuracy: Accuracy is the proportion of correct predictions out of the total predictions made by the model. With an accuracy of 85.61%, the model is correctly classifying a high percentage of instances overall. However, it’s worth noting that accuracy alone doesn’t capture the balance between false positives and false negatives, which is why precision and recall are also important.
Precision: Precision (or Positive Predictive Value) measures the accuracy of the positive predictions. The precision score of 0.7283 indicates that when the model predicts a customer will be "retained," it is correct 72.83% of the time. This score is helpful in understanding the model’s reliability for positive predictions (i.e., predicting retention). It is calculated as:
Recall: Recall (or Sensitivity) measures how well the model captures all actual positive instances. A recall score of 0.8627 means that the model correctly identifies 86.27% of all actual retained customers. This high recall indicates that the model is good at capturing most of the retained customers but might be allowing some false positives. It is calculated as:
This model has a decent balance between precision and recall, with a solid AUC-ROC score suggesting effective classification. Fine-tuning (adjusting the various parameters in the options of this model) could further improve the balance between precision and recall, depending on whether it’s more critical to avoid false positives or false negatives in the retention context.
Once the model has been trained and evaluated, it can be used to make predictions on new data in customer_inference_dataset
The results would be:
You can go ahead and materialize the dataset if you like: