STATSML 604: Car Loan Propensity Prediction using Logistic Regression
Overview
Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.
Stage 1: Awareness
The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).
Signals:
Search behavior (Google/search logs)
Browsing car loan info on bank websites
Interactions with car dealerships or vehicle-related services
Stage 2: Research
They begin comparing options, calculating EMIs, and checking eligibility.
Signals:
Clicks on car loan calculators
App logins increase
Engaging with bank agents or chatbot loan FAQs
Increasing balance inquiries
Stage 3: Financial Readiness
Evaluating “Can I afford it?”
Signals:
Growth in monthly net income
Stable or rising credit score
High cash reserves (in savings + checking)
Existing auto loans closed or paid off
Stage 4: Application Intent
They are ready to apply.
Signals:
Clicking on “Apply Now” or loan inquiry forms
Uploading documents to the portal
In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on profile and bank transaction behavior.
Use Cases
Targeted Campaigns: Focus offers on high-propensity segments
Loan Eligibility Filtering: Pre-qualify candidates automatically
Customer Risk Profiling: Understand financial behavior deep.
Prerequisites
PREP 400: DBVisualizer SQL Editor Setup for Data DistillerSTATSML 600: Data Distiller Advanced Statistics & Machine Learning ModelsGoals
Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:
Demographic and behavioral data
Financial account balances
Derived income and loan features
Sample Dataset
Fields & Description
customer_id
STRING
Unique identifier
age
INT
Customer age
gender
STRING
Male / Female / Other
marital_status
STRING
Married / Single / Divorced
employment_status
STRING
Employed / Self-employed / Retired etc.
annual_income
DECIMAL
Yearly income
credit_score
INT
Credit bureau score
checking_balance
DECIMAL
Balance in checking account
savings_balance
DECIMAL
Balance in savings account
monthly_debit_volume
DECIMAL
Monthly average spending
monthly_credit_volume
DECIMAL
Monthly average income
loan_history
ARRAY<STRUCT>
Past loans (type, amount, status)
existing_auto_loan
BOOLEAN
Existing car loan
owns_vehicle
BOOLEAN
Whether customer owns a car
propensity_car_loan
FLOAT
Target: Likelihood (0-1) of taking a loan
If you were to go and explore the sample data, it should look like this:

Derived Features
These features are engineered to improve model performance:
savings_to_income_ratio
savings_balance / annual_income
debt_to_income_ratio
(monthly_debit_volume * 12) / annual_income
avg_monthly_net_income
monthly_credit_volume - monthly_debit_volume
loan_count
COUNT(loan_history)
previous_auto_loans
COUNT WHERE loan_type = 'Auto'
good_credit_flag
credit_score >= 700
high_cash_reserve_flag
checking_balance + savings_balance > 10000
Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;
This should look like the following:

Auto-Labeling of Training Data
Not having an auto loan
0.30
More likely to consider buying
Doesn’t own a vehicle
0.20
May need a car, hence loan
Good credit score
0.20
More eligible for credit
Income > $60K
0.10
Likely to get approved
Checking balance > $2K
0.10
Has funds for down payment
Net income > $2K
0.10
Better repayment capacity

Model Definition
DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;
CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
vector_assembler(array(
age,
annual_income,
credit_score,
checking_balance,
savings_balance,
monthly_debit_volume,
monthly_credit_volume,
existing_auto_loan,
owns_vehicle,
savings_to_income_ratio,
debt_to_income_ratio,
avg_monthly_net_income,
good_credit_flag,
high_cash_reserve_flag
)) features
)
OPTIONS (
MODEL_TYPE = 'logistic_reg',
LABEL = 'propensity'
)
AS
SELECT
*,
CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;
Model Evaluation
SELECT * FROM model_evaluate(
customer_propensity_using_LogisticRegression,
1,
(
SELECT
*,
CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
)
);
Results
The query should return the folllowing:

AUC ROC
0.9362
Accuracy
0.9361
Precision
0.9367
Recall
0.9372
Predict on New Customers
SELECT * FROM model_predict(
customer_propensity_using_LogisticRegression,
1,
(
SELECT
*,
CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_predict
)
);
The query looks like the foilowing:

Last updated