STATSML 604: Car Loan Propensity Prediction using Logistic Regression

Overview

Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.

Stage 1: Awareness

The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).
Signals:
- Search behavior (Google/search logs)
- Browsing car loan info on bank websites
- Interactions with car dealerships or vehicle-related services

Stage 2: Research

They begin comparing options, calculating EMIs, and checking eligibility.
Signals:
- Clicks on car loan calculators
- App logins increase
- Engaging with bank agents or chatbot loan FAQs
- Increasing balance inquiries

Stage 3: Financial Readiness

Evaluating “Can I afford it?”
Signals:
- Growth in monthly net income
- Stable or rising credit score
- High cash reserves (in savings + checking)
- Existing auto loans closed or paid off

Stage 4: Application Intent

They are ready to apply.
Signals:
- Clicking on “Apply Now” or loan inquiry forms
- Uploading documents to the portal

In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on profile and bank transaction behavior.

Use Cases

Targeted Campaigns: Focus offers on high-propensity segments
Loan Eligibility Filtering: Pre-qualify candidates automatically
Customer Risk Profiling: Understand financial behavior deep.

Prerequisites

PREP 400: DBVisualizer SQL Editor Setup for Data Distiller STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models

Goals

Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:

Demographic and behavioral data
Financial account balances
Derived income and loan features

Sample Dataset

Fields & Description

Field Name

Data Type

Description

customer_id

STRING

Unique identifier

age

INT

Customer age

gender

STRING

Male / Female / Other

marital_status

STRING

Married / Single / Divorced

employment_status

STRING

Employed / Self-employed / Retired etc.

annual_income

DECIMAL

Yearly income

credit_score

INT

Credit bureau score

checking_balance

DECIMAL

Balance in checking account

savings_balance

DECIMAL

Balance in savings account

monthly_debit_volume

DECIMAL

Monthly average spending

monthly_credit_volume

DECIMAL

Monthly average income

loan_history

ARRAY<STRUCT>

Past loans (type, amount, status)

existing_auto_loan

BOOLEAN

Existing car loan

owns_vehicle

BOOLEAN

Whether customer owns a car

propensity_car_loan

FLOAT

Target: Likelihood (0-1) of taking a loan

If you were to go and explore the sample data, it should look like this:

Derived Features

These features are engineered to improve model performance:

Feature

Formula / Logic

savings_to_income_ratio

savings_balance / annual_income

debt_to_income_ratio

(monthly_debit_volume * 12) / annual_income

avg_monthly_net_income

monthly_credit_volume - monthly_debit_volume

loan_count

COUNT(loan_history)

previous_auto_loans

COUNT WHERE loan_type = 'Auto'

good_credit_flag

credit_score >= 700

high_cash_reserve_flag

checking_balance + savings_balance > 10000

Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;

This should look like the following:

Auto-Labeling of Training Data

Feature

Weight

Explanation

Not having an auto loan

0.30

More likely to consider buying

Doesn’t own a vehicle

0.20

May need a car, hence loan

Good credit score

0.20

More eligible for credit

Income > $60K

0.10

Likely to get approved

Checking balance > $2K

0.10

Has funds for down payment

Net income > $2K

0.10

Better repayment capacity

Model Definition

DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;

CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
  vector_assembler(array(
     age,
     annual_income,
     credit_score,
     checking_balance,
     savings_balance,
     monthly_debit_volume,
     monthly_credit_volume,
     existing_auto_loan,
     owns_vehicle,
     savings_to_income_ratio,
     debt_to_income_ratio,
     avg_monthly_net_income,
     good_credit_flag,
     high_cash_reserve_flag
  )) features
)
OPTIONS (
  MODEL_TYPE = 'logistic_reg',
  LABEL = 'propensity'
)
AS
SELECT
  *,
  CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;

Model Evaluation

SELECT * FROM model_evaluate(
  customer_propensity_using_LogisticRegression, 
  1, 
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_train
  )
);

Results

The query should return the folllowing:

Metric

Value

AUC ROC

0.9362

Accuracy

0.9361

Precision

0.9367

Recall

0.9372

Predict on New Customers

SELECT * FROM model_predict(
  customer_propensity_using_LogisticRegression, 
  1,
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_predict
  )
);

The query looks like the foilowing:

Last updated 2 months ago