Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Scenario
  • Exploratory 1-Dimensional Queries
  • Explore web Object
  • Explore marketing Object
  • Explore channel Object
  • Explore ProductListItems Array Object
  • Explore commerce Object
  • Explore endUserIDs Object
  • Create a Semi-Flat View of the Adobe Analytics Data
  • Appendix: Adobe App Schemas to Explore
  • Adobe Journey Optimizer
  • Adobe Commerce
  • Adobe Campaign
  • Marketo Engage
  • Appendix: Array Operations
  1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data

Last updated 8 months ago

Prerequisites

You need to make sure you complete this module that ingests Adobe Analytics web data into the Platform:

And of course, you should have:

Scenario

We are going to ingest LUMA data into our test environment. This is a created by Adobe

The fastest way to understand what is happening on the website is to check the Products tab. There are 3 categories of products for different (and all) personas. You can browse them. You authenticate yourself and also can add items to a cart. The data that we are ingesting into the Platform is the test website traffic data that conforms to the Adobe Analytics schema.

We need to run some analytical queries on this dataset.

Exploratory 1-Dimensional Queries

The goal of this exercise is to explore every column of the dataset individually so that we get a deep understanding of the columns. Once we understand each column, we can then build 2-dimensional and even n-dimensional queries.

Let us first retrieve all the results:

SELECT * FROM luma_web_data;

You can see that there are complex nested objects. Instead of going into the XDM schemas, we can query the data in place by using to_json.

Let us dig into the web JSON object (or XDM field group):

Explore web Object

SELECT to_json(web) FROM luma_web_data;

Let us dig one level deeper into webPageDetails. We will use the dot notation to access any field in the hierarchy.

SELECT web.webPageDetails FROM luma_web_data;

We can apply to_json again:

SELECT web.webPageDetails FROM luma_web_data;

pageViews is an object. Let us access the elements of that array

SELECT to_json(web.webPageDetails.pageViews) FROM luma_web_data;

You will get the following:

We can access the value by:

SELECT web.webPageDetails.pageViews.value FROM luma_web_data

And you will get:

Explore marketing Object

Let us work on the marketing object:

SELECT to_json(marketing) FROM luma_web_data;

The results show information about campaigns:

Explore channel Object

If you execute the following code:

SELECT to_json(marketing), to_json(channel) FROM luma_web_data;

You will observe that there is duplication of data across these fields. marketing object truly has a campaign name while the other fields are present in the channel object.

Let us extract the channel type that is in the type field of the channel object as it has values such as search, email, and social.

The code for this will be:

SELECT channel._id AS tracking_code, regexp_extract(channel._type, '[^/]+$', 0) AS channel_type, channel.mediaType AS channel_category FROM luma_web_data

The result will be:

Note the usage of the regular expression that is extracting the last word in the _type field that looks like _type":"https://ns.adobe.com/xdm/channel-types/XXX"

  • regexp_extract(channel._type, '[^/]+$', 0): This is the main part of the query where you use the regexp_extract function to perform regular expression extraction.

    • channel._type: This specifies the JSON field "_type" inside the channel JSON object.

    • '[^/]+$': This is a regular expression pattern. Let's break it down:

      • [^/]: This part matches any character except a forward slash ("/").

        • +: This indicates that the previous pattern ([^/]) should occur one or more times consecutively.

        • $: This anchors the pattern to the end of the string.

    • 0: This argument specifies the group index to return. In this case, 0 means that the entire match (the matched string) will be returned.

Explore ProductListItems Array Object

Let us access the ProductListItems array:

SELECT to_json(productListItems) FROM luma_web_data;

Hint: A single page view for Add to Cart event will have multiple product items.

To access the first elements of this array, use the following:

SELECT productListItems[0] FROM luma_web_data;

Arrays offer themselves to even more interesting SQL queries. Arrays can be exploded i.e. each element of the array can be put into a separate row of a new table and other columns/fields will be duplicated:

SELECT explode(productListItems) FROM luma_web_data;

Hint: You can also unnestfunction instead of explode.

Explore commerce Object

Let us now explore the commerce object:

SELECT to_json(commerce) FROM luma_web_data;

commerce object shows some commerce-related actions such as checkouts that thewebPageDetails object does not have.

Let us reformat this object so that we can extract the commerce event types such as productViews, productListAdds and checkouts as strings. I want to do this because I want to use GROUP BY on these event types later on. The fact that some of them are populated while some are not indicates that this is a nested structure and we will have no choice but to look at the commerce object itself in the XDM schema.

First, let us extract these fields as strings:

SELECT (CASE
        WHEN commerce.checkouts.`value`==1 THEN 'checkouts'
        WHEN commerce.productViews.`value`==1 THEN 'productViews'
        WHEN commerce.productListAdds.`value`==1 THEN 'productListAdds'
    END) AS commmerce_event_type
FROM luma_web_data

The results are:

Note: The syntax of commerce.checkouts.`value`==1. Here value has got two single opening quotation marks. This is to avoid conflict of value as a RESERVED keyword. The same will apply for commerce.`order`.* as well.

But our string-based approach has a serious flaw. If you check the field group commerce, you will see a lot of commerce event types. There is no guarantee that we will only see the 3 eveent types that we identified above:

To extract an arbitrary field name of the structs present in the commerce object, we will use:

SELECT commerce_event_type[0] AS commerce_event_type FROM (SELECT json_object_keys(to_json(commerce)) AS commerce_event_type FROM luma_web_data);

The result will be:

Note the following:

  1. json_to_keys extracts the top-level keys of the JSON objects present in commerce.

  2. to_json converts the JSON object to a string.

  3. commerce_event_type[0] extracts the first and only element of this array.

  4. Note that different structs in the commerce object have different values. Page view type structs will have a value equal to 1 while purchase type structs will have purchase totals. This extraction only works for extracting the commerce event types but does not extrapolate to the metadata of those events.

Alternatively, we could have simplified this query by avoiding the outer SELECT query by simply doing the following which will help us later:

SELECT json_object_keys(to_json(commerce))[0] AS commerce_event_type FROM luma_web_data

Explore endUserIDs Object

Let us also check the endUserIDs

SELECT to_json(endUserIDs) FROM luma_web_data;

We can extract the email addresses by using:

SELECT endUserIDs._experience.emailid.id FROM luma_web_data;

We can extract the mcids by using

SELECT endUserIDs._experience.mcid.id FROM luma_web_data;

The results are:

Create a Semi-Flat View of the Adobe Analytics Data

Let us take the queries that we built and put them all together to create a SQL query that creates a somewhat flat structure of the data i.e. we will not expand ProductListItems.

CREATE TEMP TABLE Adobe_Analytics_View AS
SELECT _id AS event_id, 
       `timestamp` AS `TimeStamp`,   
       endUserIDs._experience.mcid.id AS mcid_id,
       endUserIDs._experience.emailid.id  AS email_id,
       web.webPageDetails.`name` AS WebPageName,
       json_object_keys(to_json(commerce))[0] AS commerce_event_type,
       productListItems AS productListItems,
       marketing.campaignName AS campaignName,
       channel._id AS campaign_tracking_code, 
       regexp_extract(channel._type, '[^/]+$', 0) AS channel_type,
       channel.mediaType AS channel_category
FROM luma_web_data;

SELECT * FROM  Adobe_Analytics_View 

Note the following:

  1. We have assembled all the 1-dimensional SELECT queries into a view.

  2. The view is semi-flat because ProductListItems is not flattened i.e. put into separate rows or columns.

  3. We use a CREATE TEMP TABLE to store this view instead of materializing the view immediately because we want TEMP tables to be cached in Data Distiller for fast exploration.

Tip: If you want fast exploration of data in ad hoc query engine, just create a TEMP TABLE with the data that you want to explore. Remember that these temp tables are wiped after the user session ends as the cache is ephemeral.

Warning: If you are on and off, DBVisualizer will disconnect from Data Distiller. In that case, it will complain that the temp table does not exist because your session needs to be reestablished. In such situations where you cannot maintain the connectivity for long periods of time, you are better off just using CREATE TABLE which will materialize the data onto the Data Lake.

If you decide to use CREATE TABLE:

CREATE TABLE Adobe_Analytics_View AS
SELECT _id AS event_id, 
       `timestamp` AS `TimeStamp`,   
       endUserIDs._experience.mcid.id AS mcid_id,
       endUserIDs._experience.emailid.id  AS email_id,
       web.webPageDetails.`name` AS WebPageName,
       json_object_keys(to_json(commerce))[0] AS commerce_event_type,
       productListItems AS productListItems,
       marketing.campaignName AS campaignName,
       channel._id AS campaign_tracking_code, 
       regexp_extract(channel._type, '[^/]+$', 0) AS channel_type,
       channel.mediaType AS channel_category
FROM luma_web_data;

SELECT * FROM  Adobe_Analytics_View 

Regardless of what you use, the results of the query look like:

With this view, you are now set to do any kind of analysis. The methodology shown above can be applied to any schemas we get from the Adobe Apps.

Appendix: Adobe App Schemas to Explore

The skills that you have learned in this module should set you up for success with any complex dataset that you will come across in the Adobe ecosystem.

Adobe Journey Optimizer

If you are interested in Adobe Journey Optimizer, you should explore this module:

Adobe Commerce

You should also explore Adobe Commerce with Adobe Experience Platform integration. Specifically, you need to be aware of this:

Adobe Campaign

Marketo Engage

Appendix: Array Operations

These are the array functions supported in Data Distiller:

  • size() to determine the number of elements in a list (array)

  • The bracket [] notation to access specific elements in arrays

  • transform() to apply a transformation to all elements in an array

  • explode() to transform elements in a list into single rows

  • posexplode() to transform elements in a list into single rows along with a column for the index the element had in the original list

  • array_contains() to determine if an array contains a specific element

  • array_distinct() to remove duplicates from an array

  • array_except() to subtract to arrays

  • array_intersect() to determine the intersection (overlapping elements) of two arrays

  • array_union() to determine the union of two arrays without duplicates

  • array_join() to concatenate the elements of an array using a delimiter

  • array_max() to get the largest element from an array

  • array_min() to get the smallest element from an array

  • array_position() to a specific element from an array counting starting with 1

  • array_remove() to remove a specific element from an array

  • array_repeat() to repeat the elements of an array a specific number of times

  • array_sort() to sort an array

  • arrays_overlap() to check if two arrays have at least one common element

The (JSON fields as a group) are very similar to that of Adobe Analytics schema.

There are that are unique to Adobe Commerce because of the nature of the storefront setup.

There can always bethat are unique to an industry or implementation.

You can bring into the Adobe Experience Platform.

You can all of these datasets from Marketo:

You can also bring in as well.

Adobe Commerce field groups
some field groups
custom events
Campaign V8 delivery logs
bring in
custom activity data
PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
fictitious online store
Luma website
Results of SELECT *
to_json is able to get information about the various fields within the web object.
Digging into webPageDetails
to_json can be applied at any level of the hierarchy.
Accessing pageViews details
value=1 indicates that there was a single page view.
marketing object gives information about the campaign association.
Duplication of fields in marketing and channel objects.
channel object
Extraction of the channel fields.
ProductListItems captures product information about the items added to cart or even browsed.
Accessing the first element of an array.
EXPLODE on the ProductListItems object. Array elements are put in separate rows.
commerce object detaills.
Extracting field names by using CASE logic.
commerce object contains a lot of commerce object types.
Results of using the json_keys_object function retrieves all the possible field names of the structs in the commerce object.
endUserIDs contains email and mmcid as the identities of the person.
Emails extracted from the ednUserIDs object.
mcids extracted from the ednUserIDs object.
Semi-flat view of Adobe Analytics data.
Page cover image