Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Data Ingestion Primer
  • Accessing Dataset Batch Metadata
  • Querying a Batch in a Dataset
  1. Unit 2: DATA DISTILLER DATA EXPLORATION

EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller

It is important for you to understand how the data ingestion process works and why interrogating the records ingested in a batch may be an important tool in your arsenal to address downstream issues.

Last updated 9 months ago

Data Ingestion Primer

One of the key questions that you will need to answer at some point is verifying and validating the records within a batch that has been successfully ingested into the Adobe Experience Platform.

Remember that the concept of "batch" is a data ingestion concept where a collection of records contained in a file or otherwise, batch or streaming is materialized as a "unit" on the data lake. In essence, it is a materialization construct used by AEP.

Records that are ingested have to pass through several checks before such materialization can take place. This is handled during the mapping part of the data ingestion process. There are several categories of issues that can arise and you need to be aware of them. They will manifest themselves with error codes if you peek into a dataset

  1. Navigate to the Datasets pane and if you are unlucky, click on a batch that has failed:

  2. You will see a bunch of errors that look like this perhaps:

Some bad things have happened to our data ingestion. Let us understand the error codes:

  1. ERROR: These are the most egregious of errors possible where data corruption or non-conformance to a format was not followed. Such types of failures are serious and the entire batch will fail.

  2. DCVS: Not seen in the above example but these are less serious than data corruption issues such as a missing required field. All of these rows are just skipped. A separate dataset containing such records is NOT available as a dataset on the data lake. These records are kept in a separate location and accessible through the error diagnostics tools (UI or API). The reality of dealing with such situations is that if those skipped records are critical for your use case, you will need surgically identify them in the source system and re-ingest the data. And if that is

  3. MAPPER: These appear to be the least harmful of the three but you need to pay attention to them because these are rows that make it to the final dataset BUT the data may have been altered in the process. The mapping process tries to do a data type conversion of the string data that is at its input to the output datatype. When it cannot do so because of a malformed string, it will NULLs in the result. If you were not paying attention, you now have a field that has been NULLs that possibly could have been rectified by you. Thus batches with MAPPER warnings become a good candidate for some data exploration to see what is going on.

Accessing Dataset Batch Metadata

In order to see what system fields are available in the dataset, set the following in a session:

set drop_system_columns=false;
select * from movie_data

By doing so, you will see two new columns that will appear to the far right: acp_system_metadata and _ACP_BATCHID.

As data gets ingested into the platform, a logical partition is assigned to the data based on what data is coming at the input. _acp_system_metadata.sourceBatchId is a logical partition and _ACP_BATCHID is a physical partition after the data has been mastered into the data lake in Adobe Experience Platform.

Let us execute the following query:

select  _acp_system_metadata, count(distinct _ACP_BATCHID) from movie_data
group by _acp_system_metadata

The results are:

This means that number of batches in the input need not correspond to the number of batches written. In fact, the system decides the most efficient way to batch and master the data onto the data lake. Let me explain this through an example below.

Let's run this on a different dataset below. For those of you who are motivated, you need to ingest this data using XDM mapping into the Adobe Experience Platform.

This file is a deeply nested set of 35,000 records and they look like this:

select * from drug_orders

Let us generate some batch-based statistics on this dataset:

select  _acp_system_metadata, count(distinct _ACP_BATCHID) as numoutputbatches, count( _ACP_BATCHID) as recordcount from drug_orders
group by _acp_system_metadata

The answers look like this:

The above shows that I created 3 input batches where I ingested 2000, 24000, and 9000 records each time. However, when they got mastered, there was only one unique batch each time.

Remember that all records visible within a dataset are the ones that successfully got ingested. That does not mean that all the records that were sent at the source input are present. You will need to look at the data ingestion failures to find the batches/records that did not make it in.

Querying a Batch in a Dataset

  1. If you want to simulate the creation of a batch go to Movie Genre Targeting Example and complete the section on ingesting CSV files.

  2. If you open up the dataset pane, you will see this:

  1. Copy the batch ID by going to the panel on the right:

  1. Now use the following query to retrieve all the records that made it into the dataset as part of that batch:

select * from movie_data
where _ACP_BATCHID='01H00BKCTCADYRFACAAKJTVQ8P'
LIMIT 1;

_ACP_BATCHID is the keyword that we will be used to filter the Batch ID. The LIMIT clause is useful if you want to restrict the number of rows displayed. A filter condition is more desirable.

  1. If you executed this query in the Query Editor in the Adobe Experience Platform, the results will be truncated at 100 rows. The editor was designed as a quick preview tool. To get up to 50,000 rows, you need to use a third-party tool like DBVisualizer (my favorite). DBeaver is also another tool used by all. Keep in mind, that these editor tools are advanced and mostly free.

196KB
Drug_checkout_data.json
movie_data table has now metadata columns available.
A grouping by on source batches shows the number of output batches
Preview of the first set of records in the JSON-based drug_orders dataset.
Distribution of how input batches were mastered at a time with record counts.
Page cover image