Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Scenario
  • Create an Account in Kaggle
  • Upload Test Data
  • Create a New Notebook
  • What is Pandas
  • Create a DataFrame
  • Show Statistics
  • Preview the DataFrame
  • Aggregation Functions
  • Define a Function
  • Generate a Randomized Yearly Purchase Column
  • Visualize the Results
  • (Extra Credit) Advanced Visualizations
  • (Extra Credit) Exploring the random library
  • Appendix
  • Other Important Python Libraries
  • Download Notebook Code
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 101: Learn Basic Python Online

The goal of this module is to teach you basic Python so that you can understand any code that you come across.

Last updated 8 months ago

Scenario

Remember the goal of the connectivity to the Data Distiller from Python is to extract a table for analysis. This table is typically a sample that will be stored as a "DataFrame" (a table within Python) and subsequent operations will operate on this local DataFrame. This is no different from downloading the results of a SQL query in Data Distiller as a local file and then reading that into Python. For our training, we will assume that you have extracted this table as a CSV file locally.

Create an Account in Kaggle

Warning: Do not use Kaggle for uploading any client or customer data. Even if it means that you are sampling the data for prototyping the algorithm. Kaggle is owned by Google and the data is kept on the cloud. You should use Kaggle for learning Python with example data. If you want to prototype with customer data, your best option is a local installation of Python with Jupyterlab as the frontend UI. But make sure you know what the governance policies are in your organization or department.

Upload Test Data

To upload the data, click on the "+" on the Homepage and "New Dataset". Upload the dataset from your local machine and name it.

Create a New Notebook

Make sure you name the notebook and also add the dataset you uploaded.

To find the path to this file, you will need to click through on the datasets and see the path:

Warning: As you go through this tutorial, you will need to copy and paste the code line by line into a notebook so that the code works correctly. I have intentionally made it this way so that you do not skip key learning points.

What is Pandas

The word "pandas" originated from "panel data", a term used for describing multi-dimensional data that vary over time.

Pandas is perhaps the most important library as far as we are concerned as it allows for manipulation and analysis in the Python programming language.

The key features that we will be using:

  1. Data Structures: Pandas introduces two main data structures, the Series and the DataFrame. A Series is essentially a one-dimensional array with labeled indices, while a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or a SQL table.

  2. SQL-Like Operations:

    1. Much like SQL engines,. Pandas provide powerful tools for manipulating and transforming data. You can filter, sort, aggregate, and reshape data easily.

    2. It has functions to handle missing data, duplicate values, and data type conversions, making data-cleaning tasks more manageable.

    3. You can combine multiple data sources through operations like merging, joining, and concatenating DataFrames.

If you need to use Pandas locally, you'll need to install it first. You can install it using the following command in your Python environment:

pip install pandas

Tip: SQLAlchemy is very similar in functionality to Pandas but is a library that is geared more toward SQL users with object-centric thinking where SQL constructs like TABLE etc are first-class constructs. Even if you lov SQL, Pandas is important for you to learn.

Create a DataFrame

Execute the following piece of code and ensure that the CSV file path is specified correctly as mentioned earlier

import pandas as pd;

# Create a variable by reading the CSV file
data = pd.read_csv('Python101Data');
# Create a DataFrame
df = pd.DataFrame(data);

# Print the full contents of the DataFrame
print(df);

You should see the results as:

Note the following:

  • import pandas: This part indicates that you want to use the functionality provided by the Pandas library in your program.

  • as pd: This part gives an alias to the imported library. By using pd, you're saying that instead of typing out "pandas" every time you want to use a Pandas function, you can use the shorter alias "pd."

  • DataFrame: As mentioned earlier, this is a class provided by the Pandas library that represents a tabular data structure, similar to a table in a database or a spreadsheet.

  • data: This is a variable that holds the data you want to use to create the DataFrame. It's usually in the form of a dictionary, where the keys represent the column names and the values represent the data for each column. Here the keys are Name, Age, and Gender.

  • print(df)will display the entire DataFrame by default if it's not too large. However, if the DataFrame is large, Pandas will display a summarized view, showing only a subset of rows and columns with an ellipsis (...) to indicate that there's more data not shown.

Show Statistics

Let us now execute:

print(df.describe());

The only column that will havee statistics is the id column.

df.describe() will generate statistics on the numerical columns of the DataFrame. This is very similar to ANALYZE TABLE command for computing statistics in Data Distiller.

  • count: The number of non-null values in each column.

  • mean: The mean (average) value of each column.

  • std: The standard deviation, which measures the dispersion of values around the mean.

  • min: The minimum value in each column.

  • 25%: The 25th percentile (also known as the first quartile).

  • 50%: The 50th percentile (also known as the median).

  • 75%: The 75th percentile (also known as the third quartile).

  • max: The maximum value in each column.

Preview the DataFrame

Let us try to preview the first 10 rows by executing:

print(df.head(10));

Aggregation Functions

Let us count the number of each gender type in the population

grouped_gender = df.groupby('gender').count();
print(grouped_gender);

Remember that grouped_gender is a DataFrame. When you use the groupby() function and then apply an aggregation function like count(), it returns a DataFrame with the counts of occurrences for each gender type. The above code is very similar to an aggregation COUNT with GROUP BY in SQL.

The answer that you will get should look like this:

Other functions that you could have used in place of count() are sum(), mean(), std(), var(), min(), max(), and median().

Define a Function

Let us create a function that computes the percentage of total for all the gender types

# Define the function
def percent_of_total(column):
  return 100*column/column.sum();

# Apply the function to the 'gender' column
percent_of_total_gender = percent_of_total(grouped_gender);
print(percent_of_total_gender);

Note the following:

  1. The hash sign # is used to create comments.

  2. Note that def has a semi-colon.

  3. return should be indented properly.

  4. The function percent_of_total is applied to each individual element in the column.

  5. percent_of_total_gender is also a DataFrame as will be obvious from the answers below.

The answers you will get will look like this:

To just retrieve a single column, let us use:

print(percent_of_total_gender['id']);

This gives

Alternatively, we could have also created a Series object instead of a DataFrame for percent_of_total_gender

percent_of_total_gender = percent_of_total(grouped_gender['id']);
print(percent_of_total_gender);

And that would give us the exact same answer.

Let us persist these results that are a Series object into a new DataFram

percent_of_total_df = percent_of_total_gender.to_frame(name='Percentage);
print(percent_of_total_df);

Results are

Generate a Randomized Yearly Purchase Column

We are going to emulate the random number generation as in the example here:

Also, let us take this new column and add it to the DataFrame. Let us execute this code:

import random;

# Function to generate random purchases
def generate_random_purchases(column):
    return random.randint(1000, 10000)  # Modify the range as needed

# Apply the function to generate purchases for each row
df['YearlyPurchases'] = df['id'].apply(generate_random_purchases)

print(df)

The results show that a new column was added:

Visualize the Results

Let us make our first foray into visualizing the histogram:

import matplotlib.pyplot as viz;
viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.show();

The code is no different from what we used for creating a DataFrame. You first initialize a handle on a library and then access the functions within that library. The function names should be self-explanatory as to what they do.

The results look like this:

The histogram looks messy so let us clean this up:

import matplotlib.pyplot as viz;
viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.tick_params(axis='x', rotation=45)
viz.tight_layout()
viz.show();

We added two extra functions

  • The viz.tick_params(axis='x', rotation=45) rotates the x-labels by 45 degrees

  • The viz.tight_layout() improves the spacing and layout of the plot elements to avoid overlapping.

(Extra Credit) Advanced Visualizations

There is one last thing we want to do here. What if we wanted to plot a histogram and a bar graph together at the same time?

The answer is that if you have ever used MATLAB, the following code will seem similar:

# Create a 1x2 grid of plots
fig, axes = viz.subplots(1, 2, figsize=(12, 5))

# Histogram Plot
axes[0].hist(df['gender'], bins=10, edgecolor='black', color='skyblue')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Gender')
axes[0].tick_params(axis='x', rotation=45)


# Bar Plot
gender_counts = df['gender'].value_counts()
axes[1].bar(gender_counts.index, gender_counts.values, color='salmon')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bar Plot of Gender')
axes[1].tick_params(axis='x', rotation=45)

# Adjust layout for better spacing
fig.tight_layout()

#Display
viz.show()

The results will look like:

Note the following in the code:

  1. The heart of this code is fig, axes = viz.subplots(1, 2, figsize=(12, 5)) Much like MATLAB, this function call creates a grid of subplots in a single figure.

    • 1: The number of rows in the grid.

    • 2: The number of columns in the grid

    • figsize=(12, 5): This specifies the size of the entire figure in inches. (12, 5) means the figure will be 12 inches wide and 5 inches tall.

    • The function returns two objects:

      • fig: The figure object, which represents the entire figure.

      • axes: A 2D array of subplot axes. In this case, it's a 1x2 array, meaning there's one row and two columns of subplot axes.

  2. fig.tight.layout() is done at the entire figure level rather than individual charts. That is how this library has been designed.

(Extra Credit) Exploring the random library

Generating random data is a good skill to acquire especially in the world of data science. The random library in Python is a built-in module that provides functions for generating random numbers and performing various random operations.

Here are some of the key functions provided by the random library:

  1. Generating Random Numbers:

    • random.random(): Generates a random float between 0 and 1.

    • random.randint(a, b): Generates a random integer between a and b (inclusive).

    • random.uniform(a, b): Generates a random float between a and b.

  2. Generating Random Sequences:

    • random.choice(sequence): Returns a random element from the given sequence

    • random.sample(sequence, k): Returns a list of k unique random elements from the sequence.

    • random.shuffle(sequence): Shuffles the elements in the sequence randomly.

  3. Random Selection:

    • random.choices(population, weights=None, k=1): Returns a list of k elements randomly selected from the population, possibly with specified weights.

  4. Randomness Simulation:

    • random.seed(a=None): Initializes the random number generator with a seed. Providing the same seed will produce the same sequence of random numbers.

    • random.random() These functions generate pseudo-random numbers, which appear random but are actually determined by an initial state (seed) of the random number generator.

Here is some example code to try out:

import random

libraries = ["NumPy", "Pandas", "Matplotlib", "TensorFlow", "Scikit-learn", "PyTorch"]

# Choose a random library from the list
random_library_choice = random.choice(libraries)

# Choose 2 random libraries without replacement (no duplicates)
random_library_sequence = random.sample(libraries, 2)

# Shuffle the list of libraries in place
random.shuffle(libraries)
random_library_shuffle = libraries

print("Randomly selected library:", random_library_choice)
print("Randomly selected sequence of libraries:", random_library_sequence)
print("Shuffled list of libraries:", random_library_shuffle)

# Set the seed for reproducibility
seed_value = 23
# Initialize the random number generator
random.seed(seed_value)
# Generate random float between 0 and 1
random_numbers = [random.random() for i in range(1,10,1)]

# print the values
print("Random numbers:", random_numbers)
print("Random numbers:", random_numbers)
print("Random numbers generated with seed", random_numbers)

Remember the syntax for for loop

for element in iterable:
    # Code block to be executed for each element
    # Indentation is crucial in Python to define the block of code inside the loop

You can also use a for loop with the range() function to iterate over a sequence of numbers:

for number in range(1, 10, 1):
    print(number)
    # Code to be executed in each iteration
    # The starting value is 1 of the sequence and it will be included
    # The ending value of the sequence is 10 and it will be excluded 
    # The step size between each number. It's optional; the default step is 1.

Appendix

Other Important Python Libraries

  1. Scientific

    1. NumPy: Similar to MATLAB. A fundamental package for scientific computing with Python. It provides support for arrays and matrices, along with mathematical functions to operate on these structures efficiently.

  2. Machine Learning

    • Scikit-learn: A machine learning library that provides simple and efficient tools for data mining and data analysis. It includes a wide variety of machine-learning algorithms and tools for tasks like classification, regression, clustering, and more.

    • TensorFlow: An open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, especially neural networks.

    • PyTorch: Another popular open-source machine learning framework, developed by Facebook's AI Research lab. It's known for its dynamic computation graph and ease of use in building neural networks. It is very popular in the research community.

  3. SQLAlchemy: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It simplifies database interactions and allows you to work with databases in a more Pythonic way. This is required for Data Distiller.

  4. Requests: A simple and elegant HTTP library for making HTTP requests to interact with web services and APIs. This is useful for working with Adobe Experience Platform APIs.

Download Notebook Code

To learn Python on the go, we will be using the notebook editor at All you need is an email address to login.

To learn more about thee random library, read .

matplotlib.pyplot is a visualization library in Python. It is unfortunate that it sounds very similar to MATLAB which also has plot commands. To plot a chart like the histogram, you can use this as a reference.

Kaggle.
site
this
61KB
Python101Data.csv
92KB
python101.ipynb
Create a New Notebook. You can also add a New Dataset
Add Python101Data data source along with the CSV data to your notebook.
Make sure you click on the data source and access the CSV file to get the full path.
print(df) summarizes the results much like a SELECT * on a table.
Descriptive statistics of the DataFrame
Preview the first 10 rows. We can change the parameter from 10 to higher or a lower number.
Various types of gender present in the dataset.
Percentage of total computation.
Percent of total result
New Dataframe object created from the Series object
Adding a column to a DataFrame.
Histogram of gender type.
Plotting two different visualizations side by side.
Page cover image