Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Scenario
  • Count the Number of Events in the AA Dataset
  • Count of Visitors and Authenticated Visitors
  • Time Range of the Data
  • Most Popular Web Pages
  • Count the Number of Visits/Sessions
  1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 201: Exploring Web Analytics Data with Data Distiller

Web analytics refers to the measurement, collection, analysis, and reporting of data related to website or web application usage.

Last updated 8 months ago

Prerequisites

You need to make sure you complete this module and its prerequisites:

Scenario

We are going to ingest LUMA data into our test environment. This is a created by Adobe

The fastest way to understand what is happening on the website is to check the Products tab. There are 3 categories of products for different (and all) personas. You can browse them. You authenticate yourself and also can add items to a cart. The data that we are ingesting into the Platform is the test website traffic data that conforms to the Adobe Analytics schema.

We need to run some analytical queries on this dataset.

Count the Number of Events in the AA Dataset

SELECT count(event_id) FROM  Adobe_Analytics_View 

The answer should be 733,265. This is also the web traffic volume.

Count of Visitors and Authenticated Visitors

SELECT COUNT(DISTINCT mcid_id) AS Cookie_Visitors,  COUNT(DISTINCT email_id) AS authenticated_Vistors FROM  Adobe_Analytics_View 

The answer you should get for both should be 30,000. This means that every cookie is associated with an email which at first instance should come across as strange. But this is demo data and we can assume that someone has done the ID resolution for us for ALL mcids.

Time Range of the Data

SELECT min(TimeStamp), max(TimeStamp) FROM  Adobe_Analytics_View 

The time range should come as 2020-06-30 22:04:47 to 2021-01-29 23:47:04

Most Popular Web Pages

SELECT WebPageName, count(WebPageName) AS WebPageCounts FROM  Adobe_Analytics_View 
GROUP BY WebPageName
ORDER BY WebPageCounts DESC

Count the Number of Visits/Sessions

One of the foundational concepts of web analytics is the idea of a session or a visit. When you visit a website, a timer starts ticking and all the pages that you visited, say in the next 30 minutes are part of that session. Sessions are great because they are the atomic unit of a journey. Customers interact with a channel or a medium as part of a session. What they do in the session has some intent or goal - if we can study what happens in these sessions, then we can get a solid understanding of the users.

SELECT mcid_id, Timestamp`, 
to_json(SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC

Let us understand the code first:

  1. to_json(SESS_TIMEOUT(Timestamp, 60 * 30): Here, the SESS_TIMEOUT function is used with the Timestamp column. This function calculates the session timeout by adding 30 minutes (60 * 30 seconds) to the given Timestamp. The result is then converted to a JSON format using the to_json function.

  2. OVER (PARTITION BY mcid_id ORDER BY Timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW): This is a window function that operates on partitions of data defined by the mcid_id column. It orders the rows within each partition based on the Timestamp column in ascending order. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW clause specifies that the window includes all rows from the beginning of the partition up to the current row.

The result is the following:

Let us now parse the results in the session object:

  1. If you look at the mcid_id column, all of those ids are sorted by the same person. The sessionization always operates on a single mcid_id

  2. timestamp_diff: The difference in time, in seconds, between the current record and the prior record. It starts with "0" for the first record and increases for the other records within the same session as indicated by depth.

  3. num: A unique session number, starting at 1 for each mcid_id. isnew is just a flag as to whether the record is the start of a new session or not.

I can now extract the session number at a visitor level and also assign it a unique session number across all visitors by doing the following:

SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC
)

Warning: I have removed to_json in the code here as I need to access the fields within the session object. If I use to_json, it will create a string and the fields cannot be extracted.

The results are the following:

Let us compute the number of visits overall:

SELECT COUNT(DISTINCT unique_session_number) FROM (
SELECT mcid_id, `Timestamp`, concat(mcid_id, '-',`session`.num) AS unique_session_number, `session`.num AS session_number_per_mcid
FROM
(
SELECT mcid_id, `Timestamp`, SESS_TIMEOUT(Timestamp, 60 * 30)
    OVER (PARTITION BY mcid_id
        ORDER BY `Timestamp`
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
    AS session
FROM Adobe_Analytics_View 
ORDER BY 'Timestamp' ASC))

The result should be 104,721.

The average number of pages visited per visit is 733,265/104,721=7. This does agree with what we see when we inspect the results.

EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
fictitious online store
Luma website
The top web pages by counts fro June 30, 2020 to Jan 29, 2021.
Sessionization on the event data
Session number assigned at the visitor level and across all visitors.
Page cover image