Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Scenario
  • Prerequisites
  • Explore the Dataset
  • Count and Filter out NULL Identity Records
  • Identify if Duplicate Identity Records Exist
  • Movie Genre Popularity by State
  • Email List for State by Movie Genre Targeting
  1. UNIT 4: DATA DISTILLER DATA ENRICHMENT

ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study

Here's a basic tutorial that displays the essential components of filtering, shaping, and data manipulation with Data Distiller.

Last updated 9 months ago

Scenario

The story starts with a US company called GitFlix, a new startup, that has been able to identify its list of users and their favorite movie genres. As a GitFlix marketer, your goal is to figure out the top genres that are popular by State and for each such combination, create a list of emails to run a campaign against.

One of the key learnings I want you to take away from this tutorial is that more than any tool or any concept such as segmentation or targeting, your understanding of data is key to unlocking value. Audiences are fluid because trends are everchanging. How you track the world and its tastes is through data. How that data is collected, managed, curated, and deployed responsibly is the ultimate act of providing great customer experience and service.

Prerequisites

Download & Setup DBVisualizer by Follow the instructions here:

Download the following file locally to your machine.

You need to also ingest CSV Files into Adobe Experience Platform by following the instructions here:

Explore the Dataset

  1. Let us write the simplest query to understand what

select * from movie_data;sql
  1. Let us count the number of records on the dataset. _id is a key that is unique and non-repeating that can be used to count the number of records. You should get 1000 in the result.

select count(distinct id) from movie_data;
  1. Since email is the primary identifier for the customers in the list, let us now find if the distinct values of the emails match the record number.

select count(distinct email) from movie_data;

The result you should get is 976. This means a couple of things:

  1. There are records that have emails as NULLs that need to be removed as they simply cannot be targeted. Note that COUNT with DISTINCT clause will not count all the NULLs as one unique value. This can happen if there were data quality issues upstream or the fact that such a record was created without requiring an email address at some point in time. We do not really know the cause of that issue.

  2. There are records that have the same email associated with them. This could happen if we allow our system to register multiple users on the same email address. If that is so, we could simply aggregate the movie genre information across these records i.e. give them all equal weight.

  3. There is another way to extract the same information using a relatively new feature in Data Distiller:

DROP TABLE IF EXISTS movie_stats;
ANALYZE TABLE movie_data COMPUTE STATISTICS as movie_stats;
SELECT * FROM movie_stats;

The results look like this:

Note that movie_stats is a TEMP table that is generated for the session per user. If you DROP this temp table in DBVisualizer, you have to reconnect to fetch the metadata from Data Distiller that this table has indeed been dropped. If you do not refresh, you will get an error that "movie_stats" exists. This limitation does not exist with the Data Distiller UI.

Most of the mathematical statistics do not show up as the datatype is of string type. But take a look at the approximate uniques. It gives you a sense of the cardinality of the various dimensions. The nullCount of24 for email shows that there are 24 records that do not have this ID. As an exercise, I still do this manually writing SQL below but just be aware that this approach also exists. And if you are wondering why I had to write two commands to get the statistics, this is because Data Distiller conforms to the PostgresSQL syntax.

Note that PostgreSQL is compliant with ANSI SQL standards. It is compatible with ANSI-SQL2008 and supports most of the major features of SQL:2016. However, the syntax accepted by PostgreSQL is slightly different from commercial engines. SQL is a popular relational database language that was first standardized in 1986 by the American National Standards Institute (ANSI). In 1987, the International Organization for Standardization (ISO) adopted SQL as an international standard.

Warning: The statistics feature is not yet supported on Accelerated Store tables. It is supported only on datasets/tables on Data Lake.

Count and Filter out NULL Identity Records

  1. Let us count the number of records that have the email field as NULL

select count(COALESCE(email, 'unknown'))-count(distinct email) AS number_null_values from movie_data;

COALESCE takes all the records that have email values as NULL and converts them into the string specified i.e. "unknown". COUNT on this coalesced field will count duplicate instances of non-null values in the system i.e. 1000 records. This number needs to be subtracted from the unique non-null values which will equal 24.

  1. To filter out the records with email values as NULL, we have:

select * from movie_data WHERE email!= '';

Identify if Duplicate Identity Records Exist

  1. Let us count the number of records that have a non-NULL email field but have duplicate emails

select COUNT (DISTINCT id)-COUNT(DISTINCT email) AS Duplicate_Values from (select * from movie_data WHERE email!= '');

First, we filter the dataset of all the NULLs and then we run COUNT DISTINCT on the id and the email fields to see if they are in line. The answer you should get here is 0 meaning that they are indeed unique.

Movie Genre Popularity by State

  1. We first group by State and movie genres without splitting the movie genres apart

select State, movie_genres, COUNT(DISTINCT email) AS CUSTOMER_COUNT from movie_data 
WHERE email!= ''
GROUP BY State, movie_genres
ORDER BY CUSTOMER_COUNT DESC

The results should look like this:

  1. We still have results such as Comedy|Drama that are counted separately from Comedy and Drama. We need to be able to add customers that have these joint movie genres to the audiences by state and movie genre. For that, I need to be able to use a regular expression function to turn the movie_genres field into an array and then use the EXPLODE command to make a row per every genre value.

First, we will split at the pipe separator and then explode the strings:

SELECT State, email, explode(split(movie_genres, '\\|', -1)) AS movie_genres from movie_data 
WHERE email!= '';

The results look like this:

  1. Remember, that we are giving equal credit to a customer for every genre that they are associated with. With that assumption, let us do a count by state for all the genres and we should see that the numbers are accurate for state and movie genre.

SELECT State, movie_genres, COUNT(email) as CUSTOMER_COUNT FROM (SELECT State, email, explode(split(movie_genres, '\\|', -1)) AS movie_genres from movie_data 
WHERE email!= '')
GROUP BY State, movie_genres
ORDER BY CUSTOMER_COUNT DESC

The results look like this:

Email List for State by Movie Genre Targeting

  1. Let us create an array of emails for each of these combinations:

SELECT State, movie_genres, COUNT(email) AS CUSTOMER_COUNT, array_agg(email) as email_list FROM (SELECT State, email, explode(split(movie_genres, '\\|', -1)) AS movie_genres from movie_data 
WHERE email!= '')
GROUP BY State, movie_genres
ORDER BY CUSTOMER_COUNT DESC

The results look like this:

  1. Since the campaigns have to be run by State and by movie genre, we need the resort to this by State column

SELECT State, movie_genres, COUNT(email) AS CUSTOMER_COUNT, array_agg(email) as email_list FROM (SELECT State, email, explode(split(movie_genres, '\\|', -1)) AS movie_genres from movie_data 
WHERE email!= '')
GROUP BY State, movie_genres
ORDER BY State, CUSTOMER_COUNT DESC
PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
PREP 500: Ingesting CSV Data into Adobe Experience Platform
72KB
Movie_data.csv
Statistics computation on numerical columns.
Page cover image