Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Overview
  • Use Cases
  • Prerequisites
  • Goals
  • Sample Dataset
  • Fields & Description
  • Derived Features
  • Auto-Labeling of Training Data
  • Model Definition
  • Model Evaluation
  • Results
  • Predict on New Customers
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 604: Car Loan Propensity Prediction using Logistic Regression

Overview

Predicting whether a customer is likely to take a car loan can significantly improve a bank’s ability to design targeted campaigns, manage credit risk, and optimize resource allocation.

Stage 1: Awareness

  • The customer realizes a need — perhaps their current car is unreliable, or their lifestyle has changed (family, job move, etc.).

  • Signals:

    • Search behavior (Google/search logs)

    • Browsing car loan info on bank websites

    • Interactions with car dealerships or vehicle-related services

Stage 2: Research

  • They begin comparing options, calculating EMIs, and checking eligibility.

  • Signals:

    • Clicks on car loan calculators

    • App logins increase

    • Engaging with bank agents or chatbot loan FAQs

    • Increasing balance inquiries

Stage 3: Financial Readiness

  • Evaluating “Can I afford it?”

  • Signals:

    • Growth in monthly net income

    • Stable or rising credit score

    • High cash reserves (in savings + checking)

    • Existing auto loans closed or paid off

Stage 4: Application Intent

  • They are ready to apply.

  • Signals:

    • Clicking on “Apply Now” or loan inquiry forms

    • Uploading documents to the portal

In this tutorial, we build a simple logistic regression model to classify a customer's car loan propensity focusing on profile and bank transaction behavior.

Use Cases

  • Targeted Campaigns: Focus offers on high-propensity segments

  • Loan Eligibility Filtering: Pre-qualify candidates automatically

  • Customer Risk Profiling: Understand financial behavior deep.

Prerequisites

Goals

Build and deploy a classification model that predicts if a customer is likely to opt for a car loan, based on:

  • Demographic and behavioral data

  • Financial account balances

  • Derived income and loan features

Sample Dataset

Fields & Description

Field Name
Data Type
Description

customer_id

STRING

Unique identifier

age

INT

Customer age

gender

STRING

Male / Female / Other

marital_status

STRING

Married / Single / Divorced

employment_status

STRING

Employed / Self-employed / Retired etc.

annual_income

DECIMAL

Yearly income

credit_score

INT

Credit bureau score

checking_balance

DECIMAL

Balance in checking account

savings_balance

DECIMAL

Balance in savings account

monthly_debit_volume

DECIMAL

Monthly average spending

monthly_credit_volume

DECIMAL

Monthly average income

loan_history

ARRAY<STRUCT>

Past loans (type, amount, status)

existing_auto_loan

BOOLEAN

Existing car loan

owns_vehicle

BOOLEAN

Whether customer owns a car

propensity_car_loan

FLOAT

Target: Likelihood (0-1) of taking a loan

If you were to go and explore the sample data, it should look like this:

Derived Features

These features are engineered to improve model performance:

Feature
Formula / Logic

savings_to_income_ratio

savings_balance / annual_income

debt_to_income_ratio

(monthly_debit_volume * 12) / annual_income

avg_monthly_net_income

monthly_credit_volume - monthly_debit_volume

loan_count

COUNT(loan_history)

previous_auto_loans

COUNT WHERE loan_type = 'Auto'

good_credit_flag

credit_score >= 700

high_cash_reserve_flag

checking_balance + savings_balance > 10000

Select
customer_id,
cast(age as int) age,
gender,
marital_status,
employment_status,
cast(annual_income as decimal(18,2)) annual_income,
cast(credit_score as Int) credit_score,
cast(checking_balance as decimal(18,2))checking_balance,
cast(savings_balance as decimal(18,2)) savings_balance,
cast(monthly_debit_volume as decimal(18,2)) monthly_debit_volume,
cast(monthly_credit_volume as decimal(18,2)) monthly_credit_volume,
--Derived Features
ROUND(savings_balance / annual_income, 3) AS savings_to_income_ratio,
ROUND((monthly_debit_volume * 12) / annual_income, 3) AS debt_to_income_ratio,
(monthly_credit_volume - monthly_debit_volume) AS avg_monthly_net_income,
CASE WHEN credit_score >= 700 THEN TRUE ELSE FALSE END AS good_credit_flag,
CASE WHEN (checking_balance + savings_balance) > 10000 THEN TRUE ELSE FALSE END AS high_cash_reserve_flag
from customer_bank_data;

This should look like the following:

Auto-Labeling of Training Data

Feature
Weight
Explanation

Not having an auto loan

0.30

More likely to consider buying

Doesn’t own a vehicle

0.20

May need a car, hence loan

Good credit score

0.20

More eligible for credit

Income > $60K

0.10

Likely to get approved

Checking balance > $2K

0.10

Has funds for down payment

Net income > $2K

0.10

Better repayment capacity

Model Definition

DROP MODEL IF EXISTS customer_propensity_using_LogisticRegression;

CREATE MODEL customer_propensity_using_LogisticRegression
TRANSFORM (
  vector_assembler(array(
     age,
     annual_income,
     credit_score,
     checking_balance,
     savings_balance,
     monthly_debit_volume,
     monthly_credit_volume,
     existing_auto_loan,
     owns_vehicle,
     savings_to_income_ratio,
     debt_to_income_ratio,
     avg_monthly_net_income,
     good_credit_flag,
     high_cash_reserve_flag
  )) features
)
OPTIONS (
  MODEL_TYPE = 'logistic_reg',
  LABEL = 'propensity'
)
AS
SELECT
  *,
  CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
FROM vw_customer_bank_profile_data_train
ORDER BY RANDOM()
LIMIT 50000;

Model Evaluation

SELECT * FROM model_evaluate(
  customer_propensity_using_LogisticRegression, 
  1, 
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_train
  )
);

Results

The query should return the folllowing:

Metric
Value

AUC ROC

0.9362

Accuracy

0.9361

Precision

0.9367

Recall

0.9372

Predict on New Customers

SELECT * FROM model_predict(
  customer_propensity_using_LogisticRegression, 
  1,
  (
    SELECT
      *,
      CASE WHEN propensity_car_loan > 0.5 THEN 1 ELSE 0 END AS propensity
    FROM vw_customer_bank_profile_data_predict
  )
);

The query looks like the foilowing:

Last updated 27 days ago

PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
A simple data exploration query with DB Visualizer
Derived features along with the base features
Auto-labeling of training data
Query results on the evaluation
Predictions on the same dataset
Page cover image