Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Introduction
  • Overview
  • Data Distiller Derived Datasets vs. Raw Dataset Export
  • AI/ML Training Use Cases
  • Enterprise Reporting
  • Adobe Analytics Batch Data Feed
  • Special Export Formats for Audiences
  • Data Landing Zone Destination
  • Prerequisites
  • Access Data Landing Zone Destination Credentials
  • Setup Azure Storage Explorer
  • Upload the Data Distiller Derived Dataset
  • Set up the Activation
  • Monitor the Destination Flow
  • Download the Data from Azure Storage Explorer
  • Manifest Files
  • DULE: Data Export Marketing Action
  • A Note on Adobe Analytics Data Feeds
  • Hit-Level Data and Identification in Adobe Experience Platform
  • Hit Identification
  • Visit and Visitor Identification
  • Commerce and Event Tracking
  • Important: Post-Processing Columns
  • Post Processing of Data
  • Sessionization
  • Attribution
  1. UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT

ACT 100: Dataset Activation with Data Distiller

Shipping your datasets to distant destinations for maximizing enterprise ROI

Introduction

In today's data-driven enterprises, activating datasets from a Customer Data Platform (CDP) plays a critical role in maximizing the value of AI/ML models, enterprise reporting, and Customer 360 initiatives. Dataset activation enables AI/ML algorithms to predict customer behavior, delivering highly personalized interactions across channels. In enterprise reporting, activated data provides real-time insights for performance tracking. For Customer 360, it unifies customer profiles, giving businesses a comprehensive view of their customers, ultimately driving better decision-making, precise targeting, and improved customer experiences across the organization.

Overview

Data Distiller offers a variety of cloud storage options accessible through the Destination UI:

  1. Cloud Storage Destinations (File-Based): Accessible via Destination UI, supporting 6 cloud storage options:

    • Amazon S3

    • Google Cloud Storage

    • Azure Data Lake Storage Gen 2

    • Azure Blob Storage

    • Data Landing Zone

    • FTP

  2. Batch Export Options:

    • Incremental and first-time full export

    • Export frequencies: 3, 6, 8, 12, 24 hours

  3. Output Formats: JSON, Parquet. CSV is not supported.

  4. Data Job Export Limits:

    • Event data

      • Datasets conforming to Experience Event Schema that have a _id and timestamp: Maximum of 365 days

      • You can workaround this by using an Profile Record Schema

    • Volume: 10 billion records across all datasets in a single job

  5. DULE Enforcement: Ensures derived datasets are manually labeled in Data Distiller for compliance.

Data Distiller Derived Datasets vs. Raw Dataset Export

Data Distiller provides additional lake storage with a flexible and generous data retention policy, ensuring that large volumes of data can be stored and accessed over extended periods to meet enterprise requirements (check license entitlements for details). It converts raw datasets into optimized, derived datasets tailored for enterprise use cases like reporting, AI/ML model training, and business insights. This transformation ensures the data is structured, relevant, and analysis-ready, eliminating the need for complex processing and simplifying the extraction of actionable insights.

Ensure that the batch export schedule is set to run after the completion of Data Distiller jobs. Currently, there is no functionality to trigger a batch export immediately after a Data Distiller job finishes, so careful scheduling is required to prevent overlap or incomplete data export.

AI/ML Training Use Cases

Exporting a derived dataset, also referred to as a feature dataset in a AI/ML context, offers significant benefits compared to working with raw data, particularly in scenarios involving data analysis, reporting, or model training. Derived datasets consist of pre-processed, structured, and often enriched information that is ready for immediate use. This structured nature provides several critical advantages:

  1. Pre-Processed and Ready for Use: Derived datasets have undergone pre-processing to clean, transform, and enhance the raw data. This involves steps such as data normalization, outlier removal, handling missing values, and applying relevant transformations. By performing these tasks ahead of time, the dataset is ready for analysis or AI/ML model training without requiring additional preparation. This significantly reduces the time and effort needed for data cleaning and preprocessing, allowing teams to focus directly on extracting insights or building models.

  2. Feature Engineering: One of the key components of a derived dataset is the inclusion of engineered features. These features are specifically designed to capture important insights, trends, or patterns that may not be apparent in the raw data. For example, features could include customer behavior patterns, time-based aggregates (like rolling averages), or calculated metrics (like customer lifetime value). By incorporating these meaningful features, derived datasets eliminate the need for analysts or data scientists to manually engineer features from raw data, thereby streamlining the analytical process.

  3. Reduced Processing Time: Since the heavy lifting of data transformation has already been done, using a derived dataset greatly reduces the processing time for queries, model training, or reports. Raw data often requires multiple rounds of cleaning, joining, and transforming before it can be used effectively, which can be resource-intensive. Derived datasets provide all of the necessary transformations in advance, allowing business users and data scientists to bypass these steps and focus on the final analysis or model optimization.

  4. Consistency Across Analyses: Derived datasets ensure that all users are working with the same set of pre-calculated features and metrics, promoting consistency across different analyses and reports. By exporting a dataset that includes standard features and attributes, organizations can avoid discrepancies that often arise when different teams calculate metrics or derive features independently from raw data. This consistency not only reduces errors but also enhances collaboration by ensuring everyone is working with the same version of the data.

  5. Improved Performance for AI/ML Models: In machine learning workflows, derived datasets often lead to better model performance. This is because the features included in the dataset have been carefully engineered to highlight relevant patterns and relationships that are crucial for model training. Pre-processed data is cleaner, more relevant, and typically optimized for specific use cases. By providing models with high-quality features, organizations can improve prediction accuracy, reduce training time, and streamline hyperparameter tuning.

  6. Cleaner and More Relevant Data: Derived datasets are typically cleaner and more relevant to specific business problems. Raw data may contain irrelevant information, missing values, or noise that can skew results. Derived datasets, on the other hand, focus on key attributes and features that have been filtered and processed for accuracy and relevance. This results in datasets that are more aligned with business objectives, providing decision-makers with higher-quality information for driving insights and making decisions.

  7. Streamlined Decision-Making for Business Users: By delivering datasets that are pre-processed and enriched with meaningful features, business users can more easily extract insights without requiring in-depth technical knowledge of data processing. The simplified structure and curated features of a derived dataset allow for faster and more accurate decision-making, whether the data is used for creating dashboards, running reports, or feeding predictive models. This enables business teams to act quickly on data-driven insights without having to navigate the complexities of raw data transformation.

Enterprise Reporting

In enterprise reporting, exporting a derived dataset offers significant advantages over working with raw data. Derived datasets are the result of pre-processed data that integrates pre-calculated facts and meaningful attributes from a well-structured data model, such as a star schema. This structure, which combines fact tables (like sales, revenue, or transaction data) with enriched lookup tables (such as customer demographics or product categories), provides several key benefits:

  1. Simplified Data Structure: Derived datasets come pre-joined and pre-aggregated, meaning that the complex relationships between fact tables and dimension tables have already been resolved. This eliminates the need for additional joins or transformations during query time, reducing the complexity of data retrieval for reporting. Users and analysts can immediately work with the data without needing to understand its underlying relational structure, leading to faster time to insight.

  2. Enhanced Performance: Because the dataset is already enriched and pre-calculated, query execution is significantly faster. Raw data often requires multiple joins and real-time transformations, which can be time-consuming, especially with large datasets. By exporting a derived dataset that includes pre-aggregated metrics (such as total sales, revenue, or customer segments), enterprises can ensure that reporting dashboards, queries, and analytics tools perform optimally, even under heavy workloads or high concurrency.

  3. Consistency and Accuracy: Exporting derived datasets ensures that the same business logic and calculation methods are applied consistently across all use cases. Whether generating dashboards, building reports, or performing ad hoc analyses, the data remains consistent because the underlying facts and metrics have been calculated and validated ahead of time. This reduces the risk of discrepancies or inconsistencies that can arise when multiple teams perform their own calculations on raw data.

  4. Pre-Integrated Lookups for Richer Insights: Derived datasets can also include enriched lookups, such as customer demographics, product categories, and other contextual attributes. These lookup tables are already linked to fact tables, providing a richer, more meaningful view of the data. For example, sales data is not only presented as raw numbers but can also be segmented by customer age, location, or product type, which enables more granular and insightful analysis without requiring additional processing steps.

  5. Improved Dashboard Creation and Decision-Making: With pre-processed data that includes both metrics and contextual information, creating dashboards and performing real-time analytics becomes more straightforward. Decision-makers can rely on the fact that the data is immediately usable, accurate, and up-to-date, allowing them to focus on interpreting insights rather than preparing or cleaning data. This helps accelerate decision-making processes and ensures that the insights derived are trustworthy and actionable.

  6. Reduced Operational Overhead: Exporting derived datasets reduces the operational burden on data teams. By doing the heavy lifting of data transformation and enrichment upfront, enterprises can minimize the number of transformations required during reporting. This leads to fewer mistakes, reduces the need for frequent reprocessing, and frees up resources to focus on more strategic tasks like data governance or advanced analytics.

Adobe Analytics Batch Data Feed

Adobe Analytics data that has been imported into the Adobe Experience Platform (AEP) Data Lake can be further processed through Data Distiller and then exported in batches for more granular analysis. This processing involves several key steps that refine the raw data to provide more meaningful insights.

  1. Sessionization: One of the core processing steps is sessionization, which groups user activities into defined sessions. This can be achieved through a window function or a specialized Data Distiller function that segments interactions into time-bound sessions. For example, all user activities within a 30-minute window can be grouped as one session. Sessionization is crucial for understanding user journeys, behavior within defined periods, and the continuity of interactions.

  2. Attribution Functions: After sessionizing the data, attribution functions are applied. These functions help assign credit for specific conversions or events to the appropriate marketing channels, touchpoints, or user actions. By applying attribution models (such as first-touch, last-touch, or multi-touch attribution), businesses can understand which marketing efforts led to conversions or significant customer actions.

  3. Deep Insights into Behavior and Attribution Patterns: Processing the data through sessionization and attribution enables businesses to gain a deeper understanding of customer behavior and how different channels, campaigns, or touchpoints contribute to desired outcomes (such as purchases, sign-ups, or other conversions). This detailed insight helps to uncover trends and patterns that might be missed with raw, unprocessed data.

  4. Batch Export for Further Analysis: Once the data has been refined through sessionization and attribution, it can be exported in batches. The batch export allows businesses to perform additional analysis, reporting, or integration with other systems. This refined data is now enriched with session-based insights and attribution details, making it more actionable for decision-making and performance tracking.

Special Export Formats for Audiences

There are limitations in Profile or Audience activation exports within Adobe Real-Time Customer Data Platform regarding the structure of the output segment. Output segments are required to follow the essential structure of identity and attributes, mirroring what is present in the Real-Time Customer Profile. Any other custom audience formatting use cases fall under the domain of Data Distiller activation.

In certain cases, you may need to export audiences in a special format as required by a destination. These formats may be unique to the destination’s data integration needs and cannot be handled by the standard Adobe Experience Platform (AEP) Destination Framework.

In such scenarios, an audience format serves as a contract between AEP and the destination. This contract defines the structure and rules for how the dataset (audience) should be exported. Essentially, these formats represent custom ways of structuring audiences that are necessary for some destinations. While audiences are typically handled as part of AEP’s Data Distiller Derived Datasets, there are special cases where the export format of an audience becomes a more tailored requirement.

Key Benefits of Using the Destination Framework:

  1. Access to Non-Cloud Storage Locations: The Destination Framework allows the export of data to various types of storage systems, including on-premises, hybrid environments, or specialized non-cloud destinations.

  2. Audience Definition Integration: The framework enables the integration of audience definitions within the Real-Time Customer Profile, ensuring that audience segmentation aligns with the required format for destinations.

Data Landing Zone Destination

The Data Landing Zone Source is a staging area on the source side where external data sources can push their data, effectively mirroring the AEP data lake but outside the governance boundary. Each sandbox has its own Source Data Landing Zone, with datasets having a 7-day time-to-live before deletion. Similarly, on the destination side, there is a Data Landing Zone Destination where data can be picked up by external systems. This setup allows you to verify dataset exports and even segment data, making it a fast and reliable method for confirming what data is being exported, which we'll utilize in our tutorial.

Prerequisites

We will create a Developer Project and use Python to access credentials for the Data Landing Zone. After that, we’ll use Azure Storage Explorer to retrieve and examine the exported data.

Access Data Landing Zone Destination Credentials

  1. Access the Data Landing Zone Destination credentials by executing the following code:

import requests

# Replace this with your sandbox name
sandbox_name = 'prod'

# The URL to access the Data Landing Zone
url = 'https://platform.adobe.io/data/foundation/connectors/landingzone/credentials?type=dlz_destination'
sandbox_name='prod'

# Set the headers
headers = {
    "Authorization": f"Bearer {access_token}",
    "x-api-key": client_id,
    "x-gw-ims-org-id": org_id,
    "x-sandbox-name": sandbox_name,
    "Content-Type": "application/json"
}

# Send the GET request to access the Data Landing Zone
response = requests.get(url, headers=headers)

# Check the response status and output the result
if response.status_code == 200:
    # Successful, get the Data Landing Zone URL
    data_landing_zone = response.json()
    print("Data Landing Zone Info:", data_landing_zone)
else:
    # Handle errors
    print(f"Failed to get Data Landing Zone. Status Code: {response.status_code}, Response: {response.text}")
    
# Send the GET request to retrieve the SAS URL
response = requests.get(url, headers=headers)

if response.status_code == 200:
    credentials = response.json()
    print("Container Name:", credentials['containerName'])
    print("SAS Token:", credentials['SASToken'])
    print("Storage Account Name:", credentials['storageAccountName'])
    print("SAS URI:", credentials['SASUri'])
else:
    print(f"Failed to get credentials: {response.status_code}")

If you want to get the Data Landing Zone Source credentials, you can get the same by just replacing the url in the above code as:

url = 'https://platform.adobe.io/data/foundation/connectors/
landingzone/credentials?type=user_drop_zone'

Setup Azure Storage Explorer

  1. Setup the Azure Storage Explorer by following the pictures in sequence

Upload the Data Distiller Derived Dataset

We are going to use the derived dataset that wee created in the following tutorial:

The CSV file is generated from the RFM_MODEL_SEGMENT View:

Set up the Activation

  1. Navigate to Connections->Destinations->Catalog->Cloud Storage->Data Landing Zone. Click Activate.

  1. Choose Datasets and Configure Destinations

  1. Configure the destination with the following parameters:

    1. Datatype: Choose Datasets

    2. Name: DLZ_Data_Distiller

    3. Description: Be descriptive or just use DLZ_Data_Distiller

    4. Folder Path:

    5. File Type: JSON

    6. Compressed Format: GZIP. Gzip (GNU zip) is a popular file compression and decompression tool used to reduce the size of files, making them easier to store and transmit. You can use any unzip facility in the destination system to retrieve the raw contents.

    7. Turn on all the alerts

  1. Click on the Destination Flow created:

  1. Click on Export Datasets

  1. Choose RFM_MODEL_SEGMENT dataset to export

  1. Configure the Batch Schedule

    1. Frequency: Change it from the default Daily setting to Hourly.

    2. Pacing: Choose 3 hours

    3. Scheduled start time: It will automatically select the closest available time for you—please do not modify it. Keep in mind, all times are in UTC.

    4. Date: The current date will automatically be to today's date.

    5. Click Next

    6. Incremental Export: Keep in mind that the data export is processed incrementally, with the first batch job uploading the complete file.

Adobe Experience Platform schedule times are always set in UTC (Coordinated Universal Time) which has tradeoffs. UTC has these advantages:

  • Global Consistency: UTC provides a single, consistent reference point for time across all regions. This eliminates confusion when dealing with your users operating in different time zones.

  • Simplified Scheduling: Having a unified time standard simplifies scheduling, particularly for global teams, as you avoid needing to adjust for daylight saving time or other regional time changes.

  • Accurate Execution: Since UTC is not affected by time zone shifts, setting schedules in UTC ensures that processes, like data ingestion or activation, run accurately and consistently.

  • Easier Debugging: Using a single time zone for all scheduled events makes tracking, logging, and debugging system events much simpler, as all timestamps are directly comparable.

Disadvantages of using UTC include the need for time zone conversions, potential confusion for non-technical users, manual adjustments for Daylight Saving Time, and a higher risk of human error in scheduling.

By executing the following command in Data Distiller, your local time will be converted to UTC, giving you a clear idea of when the schedule will run:

SELECT from_unixtime(unix_timestamp()) AS utc_time;

The above query converts the current Unix timestamp to UTC, which is not affected by daylight saving time. UTC remains constant throughout the year, so this query will always return the time in UTC regardless of local time zone changes.

  1. Click Finish to complete the setup.

Monitor the Destination Flow

  1. Click on Destinations->Browse->DLZ_Data_Distiller flow

  1. You should see the following status:

Download the Data from Azure Storage Explorer

  1. If the data export confirmation in the Adobe Experience Platform (AEP) UI is successful but the data doesn't appear in Azure Storage Explorer, try refreshing your session first. If the data still isn't visible, attempt to reconnect to your Azure Storage account. If issues persist, simply relaunch the Azure Storage Explorer application to resolve any session problems and display the newly exported data.

  2. Your Storage Explorer UI should look like this:

3. Navigate down into the folders:

4. Download the files locally by selecting the files and clicking Download:

5. Manifest file looks like this:

  1. To open the other file on a Mac, simply double-click it or unzip it if it's compressed. This should result in a JSON file:

Manifest Files

The manifest file will look like the following:

{"flowRunId":"a28e30b1-07eb-4219-8d50-317ee82a5b38","scheduledTime":"2024-09-22T21:00:00Z","exportResults":[{"sinkPath":"/66f0631a95cb962aee9454aa/exportTime=20240922210000","name":"part-00000-tid-2828907778374757437-317df0e1-c96d-4951-8b52-0ececf1ddafd-4508827-1-c000.json.gz","size":21207}]}

The manifest file for the destination export provides key information about the data export process, which is highly useful for the end user in several ways:

  • Audit & Monitoring

    • flowRunId: This is a unique internal identifier for the export process or flow within the Adobe Experience Platform. It allows you to track and trace a specific export job. In case of issues or questions dealing with Adobe Support, the flowRunId can be used by them to find logs, retry the export, or analyze the performance of the flow.

    • scheduledTime: This field shows when the export was scheduled to occur (2023-08-18T01:00:00Z in this case). This is useful for auditing purposes, verifying that the export happened at the correct time, or ensuring the scheduling of exports aligns with your needs(e.g., daily or hourly exports).

  • Data Integrity & Validation

    • exportResults:

      • sinkPath: This is the destination path where the exported data has been stored. This helps the user quickly locate the data for further processing or analysis.

      • name: This is the name of the exported file. It often contains details like the file name which help the user identify the data contents and time of export.

      • size: This specifies the size of the exported file. Knowing the file size helps the user understand the volume of data being exported and can be useful for managing storage costs, transfer speeds, or estimating the data load. If a file size is unexpectedly small or large, the you might want to investigate further to ensure no data was lost or duplicated.

DULE: Data Export Marketing Action

DULE (Data Usage Labeling and Enforcement) is a data governance system in Adobe Experience Platform (AEP) that enables you to assign specific usage labels to datasets or individual fields within a dataset. You can create and apply rules, known as policies, which link these usage labels to actions—usually allowing or disallowing certain uses. Most prebuilt policies focus on controlling access to audiences or datasets, either restricting the entire audience or dataset, or specific fields within them.

Let us explore the Data Export Marketing Action in depth

  1. Browse to Privacy->Policies->Marketing Actions->Data Export

The Data Export action (authored by Adobe) involves exporting data to any location or destination outside of Adobe products and services. Examples include downloading data to your local machine, copying data from the screen, scheduling data delivery to an external location, Customer Journey Analytics scheduled projects, downloading reports, using the Reporting API, and similar activities.

Whether this action will allow the data to be processed depends on its association with specific labels, as defined by a governing policy that determines whether the action is approved or restricted. Therefore, the marketing action itself has no inherent meaning other than enforcing a rule (in this case, an export) or not.

  1. Browse to Privacy->Policies->Labels->Data Export

The C2 contract label specifies that any marketing action associated with this usage label in a policy will result in the export of data being disallowed. This ensures that data governed by the C2 label cannot be exported for marketing purposes under any circumstances.

The C2 label is more restrictive than the C1 label, which only permits export in aggregated, anonymous form. Browse through these labels and they give you a sense of the kinds of governance policies you can impose on the data you have.

  1. Click on Privacy->Policies->Browse->3rd Party Export Restriction

It’s clear that the Data Export Marketing Action (under the Associated Marketing Action column) has been preconfigured by Adobe to automatically prevent the export of any dataset containing fields marked with the C2 contract label (under associated labels). This ensures that fields tagged with the C2 label are restricted from being exported to comply with contractual and governance rules. Thus, labels associate the color on the Marketing Action and that helps define thee policy here.

  1. Browse to Datasets->Browse->RFM_MODEL_SEGMENT dataset. Click on Manage Data & Access Labels

  1. This bring us into the Data Governance tab. Click on the pencil icon

  1. Choose C2 contract label.

  1. You will notice that all the fields in this dataset are now associated with the C2 contract label, meaning none of the fields can be exported to a third-party destination if the marketing action is enabled for that flow. However, a different dataset with the same schema could still be exported, as the labeling is applied at the dataset level. This allows for dataset-specific control, giving you the flexibility to manage export permissions on a per-dataset basis.

  1. If you want to restrict the export of certain fields across all datasets that share the same schema, you can apply data usage labels at the field level rather than the dataset level. By labeling specific fields (such as those containing sensitive or personal information) with restrictive labels like the C2 contract label, these fields will be blocked from being exported across any dataset using the same schema. Click to Schemas. Turn on Show adhoc schemas. Search for rfm. Click on ad hoc schema

The Create Dataset from CSV workflow generates both a dataset and a schema dynamically. Since there is no predefined knowledge about whether the schema represents an attribute schema (such as XDM Individual Profile), an event schema (such as XDM Experience Event), or any other standard schema, the resulting schema is referred to as an ad hoc schema. This is because the system does not automatically categorize the dataset under any specific schema type that would typically allow it to be used in predefined workflows or processes.

  1. Click on the schema and then click on Labels. You will see a screen that looks like Here you can choose an individual field and apply a C2 contract label to it. Click on the pencil icon

  1. Choose the C2 Contract Label and click Save

  1. You will see the labels applied

In general, all datasets that use this schema will have this field blocked from being exported out.

A Note on Adobe Analytics Data Feeds

Please read this tutorial on extracting the fields:

Please read the following tutorial on extracting data from nested structures like arrays and maps such as the Identities from an identityMap:

Hit-Level Data and Identification in Adobe Experience Platform

In Adobe Experience Platform, hit-level data, traditionally collected by Adobe Analytics, is stored as timestamped event data. This section provides guidance on how to map specific Adobe Analytics Data Feed columns to XDM fields in the Experience Platform. It also shows how hits, visits, and visitors are identified using XDM fields.

Hit Identification

In the Experience Platform, each "hit" represents an event triggered by a user action, such as a page view or link click, and is identified by a combination of hitid_high and hitid_low. These fields are essential for tracking each unique event or interaction.

Data Feed Column
XDM Field
Type
Description

hitid_high + hitid_low

_id

string

A unique identifier for each hit.

hitid_low

_id

string

Used together with hitid_high for unique identification.

hitid_high

_id

string

Used together with hitid_low for unique identification.

hit_time_gmt

receivedTimestamp

string

The timestamp of the hit, in UNIX® time.

cust_hit_time_gmt

timestamp

string

This timestamp is used in timestamp-enabled datasets.

Visit and Visitor Identification

Visits and visitors are identified using various identity fields in the Experience Platform. The combination of visid_high and visid_low forms a unique identifier for each visit. Additionally, customer-specific visitor IDs (e.g., cust_visid) and geolocation data are stored in the identityMap structure.

Data Feed Column
XDM Field
Type
Description

visid_high + visid_low

identityMap

object

A unique identifier for a visit.

visid_high + visid_low

endUserIDs._experience.aaid.id

string

A unique identifier for a visit.

visid_high

endUserIDs._experience.aaid.primary

boolean

Used with visid_low to uniquely identify a visit.

visid_high

endUserIDs._experience.aaid.namespace.code

string

Used with visid_low to identify a visit uniquely.

visid_low

identityMap

object

Used with visid_high to identify a visit.

cust_visid

identityMap

object

The customer visitor ID.

cust_visid

endUserIDs._experience.aacustomid.id

object

The customer visitor ID.

cust_visid

endUserIDs._experience.aacustomid.primary

boolean

The customer visitor ID namespace code.

cust_visid

endUserIDs._experience.aacustomid.namespace.code

string

Used with visid_low to identify the customer visitor ID uniquely.

geo_*

placeContext.geo.*

string, number

Geolocation data, such as country, region, or city.

Commerce and Event Tracking

Commerce events such as purchases, product views, and checkouts are critical for e-commerce use cases. These events are represented in XDM fields such as commerce.purchases, and custom events can be tracked using _experience.analytics fields.

Data Feed Column
XDM Field
Type
Description

event_list

commerce.purchases, commerce.productViews,

string

Standard commerce and custom events triggered on the hit.

page_event

web.webInteraction.type

string

The type of hit (e.g., standard hit, download link, exit link, or custom link clicked).

page_event_var_1

web.webInteraction.URL

string

A variable used in link tracking image requests. Contains the URL of the clicked link.

page_event_var_2

web.webInteraction.name

string

A variable used in link tracking image requests. Lists the custom name of the link.

paid_search

search.isPaid

boolean

A flag that indicates whether the hit matches paid search detection.

ref_type

web.webReferrertype

string

A numeric ID representing the type of referral for the hit.

Important: Post-Processing Columns

Adobe Analytics uses columns prefixed with post_ to represent data after processing. However, in the Experience Platform, there is no concept of post-processing fields for datasets collected through the Experience Platform Edge Network (Web SDK, Mobile SDK, Server API). Consequently, both pre- and post-processed data feed columns map to the same XDM field.

  • For example, both page_url and post_page_url map to web.webPageDetails.URL.

Post Processing of Data

Performing transformations like sessionization, attribution, and deduplication in your queries requires leveraging Data Distiller functions.

Sessionization

Sessionization is used to group individual hits into logical sessions based on user interactions within a given time frame. The key ideas that we shall use are the following:

  • Sessionization: The SESS_TIMEOUT function groups user events into sessions based on a timeout period (30 minutes in this case). A new session is started if no activity occurs within the timeout window.

  • Ingestion Time Tracking: The script tracks the start and end times of batch ingestion. It uses this information to process only new data and update the checkpoint logs for future reference.

  • Checkpoint Logs: This process logs the status of each batch in checkpoint_log, making it easy to track the state of data processing.

  • Anonymous Block:

Sample code for Sessionization

$$ BEGIN
   -- Disable dropping system columns
   set drop_system_columns=false;

   -- Initialize variables
   SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;

   -- Get the last processed batch ingestion time
   SET @from_batch_ingestion_time = SELECT coalesce(last_batch_ingestion_time, 'HEAD')
      FROM checkpoint_log a
      JOIN (
            SELECT MAX(process_timestamp) AS process_timestamp
            FROM checkpoint_log
            WHERE process_name = 'data_feed'
            AND process_status = 'SUCCESSFUL'
      ) b
      ON a.process_timestamp = b.process_timestamp;

   -- Get the last batch ingestion time
   SET @to_batch_ingestion_time = SELECT MAX(_acp_system_metadata.ingestTime)
      FROM events_dataset;

   -- Sessionize the data and insert into data_feed.
   INSERT INTO data_feed
   SELECT *
   FROM (
      SELECT
            userIdentity,
            timestamp,
            SESS_TIMEOUT(timestamp, 60 * 30) OVER (
               PARTITION BY userIdentity
               ORDER BY timestamp
               ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
            ) AS session_data,
            page_name,
            ingest_time
      FROM (
            SELECT
               userIdentity,
               timestamp,
               web.webPageDetails.name AS page_name,
               _acp_system_metadata.ingestTime AS ingest_time
            FROM events_dataset
            WHERE timestamp >= current_date - 90
      ) AS a
      ORDER BY userIdentity, timestamp ASC
   ) AS b
   WHERE b.ingest_time >= @from_batch_ingestion_time;

   -- Update the checkpoint_log table
   INSERT INTO checkpoint_log
   SELECT
      'data_feed' process_name,
      'SUCCESSFUL' process_status,
      cast(@to_batch_ingestion_time AS string) last_batch_ingestion_time,
      cast(@last_updated_timestamp AS TIMESTAMP) process_timestamp
END
$$;

Let us dive into the query in detail:

  1. Disable Dropping System Columns

set drop_system_columns=false;

This command disables the automatic removal of system columns. It ensures that columns like metadata (_acp_system_metadata.ingestTime) are retained and can be referenced later in the query.

  1. Initialize Variables

SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;

This statement initializes the variable @last_updated_timestamp with the current timestamp. This timestamp is later used to record when the batch process is completed.

3. Get the Last Processed Batch Ingestion Time

SET @from_batch_ingestion_time = SELECT coalesce(last_batch_ingestion_time, 'HEAD')
FROM checkpoint_log a
JOIN (
      SELECT MAX(process_timestamp) AS process_timestamp
      FROM checkpoint_log
      WHERE process_name = 'data_feed'
      AND process_status = 'SUCCESSFUL'
) b
ON a.process_timestamp = b.process_timestamp;

This block determines the time of the last successful batch ingestion by:

  • Looking at the checkpoint_log table for entries where process_name is data_feed and process_status is SUCCESSFUL.

  • The coalesce function ensures that if there's no previous ingestion time (first run), it uses the default value 'HEAD'.

  • The MAX(process_timestamp) fetches the most recent batch ingestion time.

  1. Get the Last Batch Ingestion Time

SET @to_batch_ingestion_time = SELECT MAX(_acp_system_metadata.ingestTime)
FROM events_dataset;

This fetches the maximum ingestion time (_acp_system_metadata.ingestTime) from the events_dataset to determine when the most recent batch of data was ingested.

  1. Sessionize the Data and Insert it into data_feed

INSERT INTO data_feed
SELECT *
FROM (
   SELECT
         userIdentity,
         timestamp,
         SESS_TIMEOUT(timestamp, 60 * 30) OVER (
            PARTITION BY userIdentity
            ORDER BY timestamp
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
         ) AS session_data,
         page_name,
         ingest_time
   FROM (
         SELECT
            userIdentity,
            timestamp,
            web.webPageDetails.name AS page_name,
            _acp_system_metadata.ingestTime AS ingest_time
         FROM events_dataset
         WHERE timestamp >= current_date - 90
   ) AS a
   ORDER BY userIdentity, timestamp ASC
) AS b
WHERE b.ingest_time >= @from_batch_ingestion_time;

This section performs sessionization and data insertion:

  1. Inner Query (Alias: a):

    1. Extracts relevant fields like userIdentity, timestamp, page_name, and ingest_time from events_dataset.

    2. Filters records to only include those within the past 90 days (timestamp >= current_date - 90).

  2. Sessionization (SESS_TIMEOUT):

    1. Uses the SESS_TIMEOUT function to create session boundaries with a 30-minute (1800 seconds) timeout.

    2. The OVER clause applies sessionization logic by partitioning the data by userIdentity and ordering by timestamp.

    3. Each row is assigned a session identifier based on user activity within a 30-minute window of inactivity.

  3. Outer Query (Alias: b):

    1. Selects and orders data based on userIdentity and timestamp.

    2. Filters the result set to only include data that has been ingested since the last batch ingestion (b.ingest_time >= @from_batch_ingestion_time).

  4. Insert:

    1. Inserts the sessionized data into the data_feed table.

  1. Update the Checkpoint Log

INSERT INTO checkpoint_log
SELECT
   'data_feed' process_name,
   'SUCCESSFUL' process_status,
   cast(@to_batch_ingestion_time AS string) last_batch_ingestion_time,
   cast(@last_updated_timestamp AS TIMESTAMP) process_timestamp;

This inserts a new entry into the checkpoint_log table with:

  • process_name: 'data_feed'.

  • process_status: 'SUCCESSFUL', indicating that the batch was successfully processed.

  • last_batch_ingestion_time: The most recent ingestion time (from @to_batch_ingestion_time).

  • process_timestamp: The timestamp when the process was completed (from @last_updated_timestamp).

Attribution

Attribution functions can be used to assign credit to different touchpoints in a user's journey based on predefined rules (e.g., last touch attribution).

Sample Code

$$ BEGIN
 SET drop_system_columns=false;

-- Initialize variables
 SET @last_updated_timestamp = SELECT CURRENT_TIMESTAMP;

-- Get the last processed batch ingestion time 1718755872325
 SET @from_batch_ingestion_time =
     SELECT coalesce(last_snapshot_id, 'HEAD')
     FROM checkpoint_log a
     JOIN (
         SELECT MAX(process_timestamp) AS process_timestamp
         FROM checkpoint_log
         WHERE process_name = 'data_feed'
         AND process_status = 'SUCCESSFUL'
     ) b
     ON a.process_timestamp = b.process_timestamp;

 -- Get the last batch ingestion time 1718758687865
 SET @to_batch_ingestion_time =
     SELECT MAX(_acp_system_metadata.ingestTime)
     FROM demo_data_trey_mcintyre_midvalues;

 -- Sessionize the data and insert into new_sessionized_data
 INSERT INTO new_sessionized_data
 SELECT *
 FROM (
     SELECT
         _id,
         timestamp,
         struct(User_Identity,
         cast(SESS_TIMEOUT(timestamp, 60 * 30) OVER (
             PARTITION BY User_Identity
             ORDER BY timestamp
             ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
         ) as string) AS SessionData,
         to_timestamp(from_unixtime(ingest_time/1000, 'yyyy-MM-dd HH:mm:ss')) AS IngestTime,
         PageName,
         first_url,
         first_channel_type
           ) as _demosystem5
     FROM (
         SELECT
             _id,
             ENDUSERIDS._EXPERIENCE.MCID.ID as User_Identity,
             timestamp,
             web.webPageDetails.name AS PageName,
            attribution_first_touch(timestamp, '', web.webReferrer.url) OVER (PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value AS first_url,
            attribution_first_touch(timestamp, '',channel.typeAtSource) OVER (PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID ORDER BY timestamp ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING).value AS first_channel_type,
             _acp_system_metadata.ingestTime AS ingest_time
         FROM demo_data_trey_mcintyre_midvalues
         WHERE timestamp >= current_date - 90
     )
     ORDER BY User_Identity, timestamp ASC
 )
 WHERE _demosystem5.IngestTime >= to_timestamp(from_unixtime(@from_batch_ingestion_time/1000, 'yyyy-MM-dd HH:mm:ss'));

-- Update the checkpoint_log table
INSERT INTO checkpoint_log
SELECT
   'data_feed' as process_name,
   'SUCCESSFUL' as process_status,
   cast(@to_batch_ingestion_time AS string) as last_snapshot_id,
   cast(@last_updated_timestamp AS timestamp) as process_timestamp;

END
$$;

This attribution query adds attribution logic to capture the first touch point for URLs and channel types.

Attribution Logic

In the second SQL query, two attribution functions are used:

attribution_first_touch(timestamp, '', web.webReferrer.url) OVER (
    PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID 
    ORDER BY timestamp ASC 
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
).value AS first_url,
attribution_first_touch(timestamp, '', channel.typeAtSource) OVER (
    PARTITION BY ENDUSERIDS._EXPERIENCE.MCID.ID 
    ORDER BY timestamp ASC 
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
).value AS first_channel_type,

These functions introduce first-touch attribution, which differs from the simple sessionization logic in the first query. The above query uses the attribution_first_touch() Data Distiller function to identify the first event (or touchpoint) for each session or user, capturing the first URL visited and the first channel type used during a session.

Last updated 7 months ago

You can see these ideas in action in this special note.

Setup the Developer Project based on the instructions in this

Generate the Access Token in Python based on the instructions in this

Download the Azure Storage Explorer based on the instructions in this

If you have not completed the tutorial, then follow the steps to upload the CSV file. Name the dataset as RFM_SEGMENT_MODEL. It looks like this:

Include the Manifest file. Details about Manifest files for debugging are.

Choose Data Export Marketing Action. A more detailed discussion is there in DULE .

PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
section
section
section
here
EXPLORE 201: Exploring Web Analytics Data with Data Distiller
IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
here
here
section
144KB
rfm_export.csv
Cloud storage destinations marked in green are supported.
Click on the plug icon
Click on ADLS Gen2 container or directory
Choose SAS URI option
Copy the SAS URI and choose a name for the DLZ container. Note that the SAS URI is copied from the results of the execution of the Python code above.
Click on Connect
Connection is complete. You should see the files exported here,
RFM data with anonymized email
Navigate to Connections->Destinations->Catalog->Cloud Storage->Data Landing Zone. Click Activate.
Choose Datasets intead of audiences
Basic setup
Choose the alerts
Click on the Destination Account created.
Export datasets
Choose RFM_MODEL_SEGMENT dataset to export
Configure the schedule
Click Finish to complete the setup
Click on Destinations->Browse->DLZ_Data_Distiller flow
Successful export
New folders are created
You should see the raw files.
Download the two files
Manifest files in TextEdit application on Mac
JSON file
Labels available
3rd Party Export Restriction
Click on the ellipsis to access the Data Labels
All fields are available as a flat list
Apply a label on an entire dataset
C2 contract labels are now applied to all of the fields
Locate the ad hoc schema to apply the DULE labels on
Apply the labels on a field
Choose the labels for the field
Label applied to an individual field
Page cover image