ACT 100: Dataset Activation with Data Distiller
Shipping your datasets to distant destinations for maximizing enterprise ROI
Introduction
In today's data-driven enterprises, activating datasets from a Customer Data Platform (CDP) plays a critical role in maximizing the value of AI/ML models, enterprise reporting, and Customer 360 initiatives. Dataset activation enables AI/ML algorithms to predict customer behavior, delivering highly personalized interactions across channels. In enterprise reporting, activated data provides real-time insights for performance tracking. For Customer 360, it unifies customer profiles, giving businesses a comprehensive view of their customers, ultimately driving better decision-making, precise targeting, and improved customer experiences across the organization.
Overview
Data Distiller offers a variety of cloud storage options accessible through the Destination UI:
Cloud Storage Destinations (File-Based): Accessible via Destination UI, supporting 6 cloud storage options:
Amazon S3
Google Cloud Storage
Azure Data Lake Storage Gen 2
Azure Blob Storage
Data Landing Zone
FTP
Batch Export Options:
Incremental and first-time full export
Export frequencies: 3, 6, 8, 12, 24 hours
Output Formats: JSON, Parquet. CSV is not supported.
Data Job Export Limits:
Event data
Datasets conforming to Experience Event Schema that have a
_id
andtimestamp:
Maximum of 365 daysYou can workaround this by using an Profile Record Schema
Volume: 10 billion records across all datasets in a single job
DULE Enforcement: Ensures derived datasets are manually labeled in Data Distiller for compliance.
Data Distiller Derived Datasets vs. Raw Dataset Export
Data Distiller provides additional lake storage with a flexible and generous data retention policy, ensuring that large volumes of data can be stored and accessed over extended periods to meet enterprise requirements (check license entitlements for details). It converts raw datasets into optimized, derived datasets tailored for enterprise use cases like reporting, AI/ML model training, and business insights. This transformation ensures the data is structured, relevant, and analysis-ready, eliminating the need for complex processing and simplifying the extraction of actionable insights.
Ensure that the batch export schedule is set to run after the completion of Data Distiller jobs. Currently, there is no functionality to trigger a batch export immediately after a Data Distiller job finishes, so careful scheduling is required to prevent overlap or incomplete data export.
AI/ML Training Use Cases
Exporting a derived dataset, also referred to as a feature dataset in a AI/ML context, offers significant benefits compared to working with raw data, particularly in scenarios involving data analysis, reporting, or model training. Derived datasets consist of pre-processed, structured, and often enriched information that is ready for immediate use. This structured nature provides several critical advantages:
Pre-Processed and Ready for Use: Derived datasets have undergone pre-processing to clean, transform, and enhance the raw data. This involves steps such as data normalization, outlier removal, handling missing values, and applying relevant transformations. By performing these tasks ahead of time, the dataset is ready for analysis or AI/ML model training without requiring additional preparation. This significantly reduces the time and effort needed for data cleaning and preprocessing, allowing teams to focus directly on extracting insights or building models.
Feature Engineering: One of the key components of a derived dataset is the inclusion of engineered features. These features are specifically designed to capture important insights, trends, or patterns that may not be apparent in the raw data. For example, features could include customer behavior patterns, time-based aggregates (like rolling averages), or calculated metrics (like customer lifetime value). By incorporating these meaningful features, derived datasets eliminate the need for analysts or data scientists to manually engineer features from raw data, thereby streamlining the analytical process.
Reduced Processing Time: Since the heavy lifting of data transformation has already been done, using a derived dataset greatly reduces the processing time for queries, model training, or reports. Raw data often requires multiple rounds of cleaning, joining, and transforming before it can be used effectively, which can be resource-intensive. Derived datasets provide all of the necessary transformations in advance, allowing business users and data scientists to bypass these steps and focus on the final analysis or model optimization.
Consistency Across Analyses: Derived datasets ensure that all users are working with the same set of pre-calculated features and metrics, promoting consistency across different analyses and reports. By exporting a dataset that includes standard features and attributes, organizations can avoid discrepancies that often arise when different teams calculate metrics or derive features independently from raw data. This consistency not only reduces errors but also enhances collaboration by ensuring everyone is working with the same version of the data.
Improved Performance for AI/ML Models: In machine learning workflows, derived datasets often lead to better model performance. This is because the features included in the dataset have been carefully engineered to highlight relevant patterns and relationships that are crucial for model training. Pre-processed data is cleaner, more relevant, and typically optimized for specific use cases. By providing models with high-quality features, organizations can improve prediction accuracy, reduce training time, and streamline hyperparameter tuning.
Cleaner and More Relevant Data: Derived datasets are typically cleaner and more relevant to specific business problems. Raw data may contain irrelevant information, missing values, or noise that can skew results. Derived datasets, on the other hand, focus on key attributes and features that have been filtered and processed for accuracy and relevance. This results in datasets that are more aligned with business objectives, providing decision-makers with higher-quality information for driving insights and making decisions.
Streamlined Decision-Making for Business Users: By delivering datasets that are pre-processed and enriched with meaningful features, business users can more easily extract insights without requiring in-depth technical knowledge of data processing. The simplified structure and curated features of a derived dataset allow for faster and more accurate decision-making, whether the data is used for creating dashboards, running reports, or feeding predictive models. This enables business teams to act quickly on data-driven insights without having to navigate the complexities of raw data transformation.
Enterprise Reporting
In enterprise reporting, exporting a derived dataset offers significant advantages over working with raw data. Derived datasets are the result of pre-processed data that integrates pre-calculated facts and meaningful attributes from a well-structured data model, such as a star schema. This structure, which combines fact tables (like sales, revenue, or transaction data) with enriched lookup tables (such as customer demographics or product categories), provides several key benefits:
Simplified Data Structure: Derived datasets come pre-joined and pre-aggregated, meaning that the complex relationships between fact tables and dimension tables have already been resolved. This eliminates the need for additional joins or transformations during query time, reducing the complexity of data retrieval for reporting. Users and analysts can immediately work with the data without needing to understand its underlying relational structure, leading to faster time to insight.
Enhanced Performance: Because the dataset is already enriched and pre-calculated, query execution is significantly faster. Raw data often requires multiple joins and real-time transformations, which can be time-consuming, especially with large datasets. By exporting a derived dataset that includes pre-aggregated metrics (such as total sales, revenue, or customer segments), enterprises can ensure that reporting dashboards, queries, and analytics tools perform optimally, even under heavy workloads or high concurrency.
Consistency and Accuracy: Exporting derived datasets ensures that the same business logic and calculation methods are applied consistently across all use cases. Whether generating dashboards, building reports, or performing ad hoc analyses, the data remains consistent because the underlying facts and metrics have been calculated and validated ahead of time. This reduces the risk of discrepancies or inconsistencies that can arise when multiple teams perform their own calculations on raw data.
Pre-Integrated Lookups for Richer Insights: Derived datasets can also include enriched lookups, such as customer demographics, product categories, and other contextual attributes. These lookup tables are already linked to fact tables, providing a richer, more meaningful view of the data. For example, sales data is not only presented as raw numbers but can also be segmented by customer age, location, or product type, which enables more granular and insightful analysis without requiring additional processing steps.
Improved Dashboard Creation and Decision-Making: With pre-processed data that includes both metrics and contextual information, creating dashboards and performing real-time analytics becomes more straightforward. Decision-makers can rely on the fact that the data is immediately usable, accurate, and up-to-date, allowing them to focus on interpreting insights rather than preparing or cleaning data. This helps accelerate decision-making processes and ensures that the insights derived are trustworthy and actionable.
Reduced Operational Overhead: Exporting derived datasets reduces the operational burden on data teams. By doing the heavy lifting of data transformation and enrichment upfront, enterprises can minimize the number of transformations required during reporting. This leads to fewer mistakes, reduces the need for frequent reprocessing, and frees up resources to focus on more strategic tasks like data governance or advanced analytics.
Adobe Analytics Batch Data Feed
Adobe Analytics data that has been imported into the Adobe Experience Platform (AEP) Data Lake can be further processed through Data Distiller and then exported in batches for more granular analysis. This processing involves several key steps that refine the raw data to provide more meaningful insights.
Sessionization: One of the core processing steps is sessionization, which groups user activities into defined sessions. This can be achieved through a window function or a specialized Data Distiller function that segments interactions into time-bound sessions. For example, all user activities within a 30-minute window can be grouped as one session. Sessionization is crucial for understanding user journeys, behavior within defined periods, and the continuity of interactions.
Attribution Functions: After sessionizing the data, attribution functions are applied. These functions help assign credit for specific conversions or events to the appropriate marketing channels, touchpoints, or user actions. By applying attribution models (such as first-touch, last-touch, or multi-touch attribution), businesses can understand which marketing efforts led to conversions or significant customer actions.
Deep Insights into Behavior and Attribution Patterns: Processing the data through sessionization and attribution enables businesses to gain a deeper understanding of customer behavior and how different channels, campaigns, or touchpoints contribute to desired outcomes (such as purchases, sign-ups, or other conversions). This detailed insight helps to uncover trends and patterns that might be missed with raw, unprocessed data.
Batch Export for Further Analysis: Once the data has been refined through sessionization and attribution, it can be exported in batches. The batch export allows businesses to perform additional analysis, reporting, or integration with other systems. This refined data is now enriched with session-based insights and attribution details, making it more actionable for decision-making and performance tracking.
You can see these ideas in action in this special note here.
Special Export Formats for Audiences
There are limitations in Profile or Audience activation exports within Adobe Real-Time Customer Data Platform regarding the structure of the output segment. Output segments are required to follow the essential structure of identity and attributes, mirroring what is present in the Real-Time Customer Profile. Any other custom audience formatting use cases fall under the domain of Data Distiller activation.
In certain cases, you may need to export audiences in a special format as required by a destination. These formats may be unique to the destination’s data integration needs and cannot be handled by the standard Adobe Experience Platform (AEP) Destination Framework.
In such scenarios, an audience format serves as a contract between AEP and the destination. This contract defines the structure and rules for how the dataset (audience) should be exported. Essentially, these formats represent custom ways of structuring audiences that are necessary for some destinations. While audiences are typically handled as part of AEP’s Data Distiller Derived Datasets, there are special cases where the export format of an audience becomes a more tailored requirement.
Key Benefits of Using the Destination Framework:
Audience Definition Integration: The framework enables the integration of audience definitions within the Real-Time Customer Profile, ensuring that audience segmentation aligns with the required format for destinations.
Data Landing Zone Destination
The Data Landing Zone Source is a staging area on the source side where external data sources can push their data, effectively mirroring the AEP data lake but outside the governance boundary. Each sandbox has its own Source Data Landing Zone, with datasets having a 7-day time-to-live before deletion. Similarly, on the destination side, there is a Data Landing Zone Destination where data can be picked up by external systems. This setup allows you to verify dataset exports and even segment data, making it a fast and reliable method for confirming what data is being exported, which we'll utilize in our tutorial.
Prerequisites
We will create a Developer Project and use Python to access credentials for the Data Landing Zone. After that, we’ll use Azure Storage Explorer to retrieve and examine the exported data.
Access Data Landing Zone Destination Credentials
Setup the Developer Project based on the instructions in this section
Generate the Access Token in Python based on the instructions in this section
Access the Data Landing Zone Destination credentials by executing the following code:
If you want to get the Data Landing Zone Source credentials, you can get the same by just replacing the url
in the above code as:
Setup Azure Storage Explorer
Download the Azure Storage Explorer based on the instructions in this section
Setup the Azure Storage Explorer by following the pictures in sequence
Upload the Data Distiller Derived Dataset
We are going to use the derived dataset that wee created in the following tutorial:
The CSV file is generated from the RFM_MODEL_SEGMENT
View:
If you have not completed the tutorial, then follow the steps here to upload the CSV file. Name the dataset as RFM_SEGMENT_MODEL.
It looks like this:
Set up the Activation
Navigate to Connections->Destinations->Catalog->Cloud Storage->Data Landing Zone. Click Activate.
Choose Datasets and Configure Destinations
Configure the destination with the following parameters:
Datatype: Choose Datasets
Name: DLZ_Data_Distiller
Description: Be descriptive or just use DLZ_Data_Distiller
Folder Path:
File Type: JSON
Compressed Format: GZIP. Gzip (GNU zip) is a popular file compression and decompression tool used to reduce the size of files, making them easier to store and transmit. You can use any unzip facility in the destination system to retrieve the raw contents.
Include the Manifest file. Details about Manifest files for debugging are here.
Turn on all the alerts
Choose Data Export Marketing Action. A more detailed discussion is there in DULE section.
Click on the Destination Flow created:
Click on Export Datasets
Choose RFM_MODEL_SEGMENT dataset to export
Configure the Batch Schedule
Frequency: Change it from the default Daily setting to Hourly.
Pacing: Choose 3 hours
Scheduled start time: It will automatically select the closest available time for you—please do not modify it. Keep in mind, all times are in UTC.
Date: The current date will automatically be to today's date.
Click Next
Incremental Export: Keep in mind that the data export is processed incrementally, with the first batch job uploading the complete file.
Adobe Experience Platform schedule times are always set in UTC (Coordinated Universal Time) which has tradeoffs. UTC has these advantages:
Global Consistency: UTC provides a single, consistent reference point for time across all regions. This eliminates confusion when dealing with your users operating in different time zones.
Simplified Scheduling: Having a unified time standard simplifies scheduling, particularly for global teams, as you avoid needing to adjust for daylight saving time or other regional time changes.
Accurate Execution: Since UTC is not affected by time zone shifts, setting schedules in UTC ensures that processes, like data ingestion or activation, run accurately and consistently.
Easier Debugging: Using a single time zone for all scheduled events makes tracking, logging, and debugging system events much simpler, as all timestamps are directly comparable.
Disadvantages of using UTC include the need for time zone conversions, potential confusion for non-technical users, manual adjustments for Daylight Saving Time, and a higher risk of human error in scheduling.
By executing the following command in Data Distiller, your local time will be converted to UTC, giving you a clear idea of when the schedule will run:
SELECT from_unixtime(unix_timestamp()) AS utc_time;
The above query converts the current Unix timestamp to UTC, which is not affected by daylight saving time. UTC remains constant throughout the year, so this query will always return the time in UTC regardless of local time zone changes.
Click Finish to complete the setup.
Monitor the Destination Flow
Click on Destinations->Browse->DLZ_Data_Distiller flow
You should see the following status:
Download the Data from Azure Storage Explorer
If the data export confirmation in the Adobe Experience Platform (AEP) UI is successful but the data doesn't appear in Azure Storage Explorer, try refreshing your session first. If the data still isn't visible, attempt to reconnect to your Azure Storage account. If issues persist, simply relaunch the Azure Storage Explorer application to resolve any session problems and display the newly exported data.
Your Storage Explorer UI should look like this:
3. Navigate down into the folders:
4. Download the files locally by selecting the files and clicking Download:
5. Manifest file looks like this:
To open the other file on a Mac, simply double-click it or unzip it if it's compressed. This should result in a JSON file:
Manifest Files
The manifest file will look like the following:
The manifest file for the destination export provides key information about the data export process, which is highly useful for the end user in several ways:
Audit & Monitoring
flowRunId
: This is a unique internal identifier for the export process or flow within the Adobe Experience Platform. It allows you to track and trace a specific export job. In case of issues or questions dealing with Adobe Support, theflowRunId
can be used by them to find logs, retry the export, or analyze the performance of the flow.scheduledTime
: This field shows when the export was scheduled to occur (2023-08-18T01:00:00Z
in this case). This is useful for auditing purposes, verifying that the export happened at the correct time, or ensuring the scheduling of exports aligns with your needs(e.g., daily or hourly exports).
Data Integrity & Validation
exportResults
:sinkPath
: This is the destination path where the exported data has been stored. This helps the user quickly locate the data for further processing or analysis.name
: This is the name of the exported file. It often contains details like the file name which help the user identify the data contents and time of export.size
: This specifies the size of the exported file. Knowing the file size helps the user understand the volume of data being exported and can be useful for managing storage costs, transfer speeds, or estimating the data load. If a file size is unexpectedly small or large, the you might want to investigate further to ensure no data was lost or duplicated.
DULE: Data Export Marketing Action
DULE (Data Usage Labeling and Enforcement) is a data governance system in Adobe Experience Platform (AEP) that enables you to assign specific usage labels to datasets or individual fields within a dataset. You can create and apply rules, known as policies, which link these usage labels to actions—usually allowing or disallowing certain uses. Most prebuilt policies focus on controlling access to audiences or datasets, either restricting the entire audience or dataset, or specific fields within them.
Let us explore the Data Export Marketing Action in depth
Browse to Privacy->Policies->Marketing Actions->Data Export
The Data Export action (authored by Adobe) involves exporting data to any location or destination outside of Adobe products and services. Examples include downloading data to your local machine, copying data from the screen, scheduling data delivery to an external location, Customer Journey Analytics scheduled projects, downloading reports, using the Reporting API, and similar activities.
Whether this action will allow the data to be processed depends on its association with specific labels, as defined by a governing policy that determines whether the action is approved or restricted. Therefore, the marketing action itself has no inherent meaning other than enforcing a rule (in this case, an export) or not.
Browse to Privacy->Policies->Labels->Data Export
The C2 contract label specifies that any marketing action associated with this usage label in a policy will result in the export of data being disallowed. This ensures that data governed by the C2 label cannot be exported for marketing purposes under any circumstances.
The C2 label is more restrictive than the C1 label, which only permits export in aggregated, anonymous form. Browse through these labels and they give you a sense of the kinds of governance policies you can impose on the data you have.
Click on Privacy->Policies->Browse->3rd Party Export Restriction
It’s clear that the Data Export Marketing Action (under the Associated Marketing Action column) has been preconfigured by Adobe to automatically prevent the export of any dataset containing fields marked with the C2 contract label (under associated labels). This ensures that fields tagged with the C2 label are restricted from being exported to comply with contractual and governance rules. Thus, labels associate the color on the Marketing Action and that helps define thee policy here.
Browse to Datasets->Browse->RFM_MODEL_SEGMENT dataset. Click on Manage Data & Access Labels
This bring us into the Data Governance tab. Click on the pencil icon
Choose C2 contract label.
You will notice that all the fields in this dataset are now associated with the C2 contract label, meaning none of the fields can be exported to a third-party destination if the marketing action is enabled for that flow. However, a different dataset with the same schema could still be exported, as the labeling is applied at the dataset level. This allows for dataset-specific control, giving you the flexibility to manage export permissions on a per-dataset basis.
If you want to restrict the export of certain fields across all datasets that share the same schema, you can apply data usage labels at the field level rather than the dataset level. By labeling specific fields (such as those containing sensitive or personal information) with restrictive labels like the C2 contract label, these fields will be blocked from being exported across any dataset using the same schema. Click to Schemas. Turn on Show adhoc schemas. Search for rfm. Click on ad hoc schema
The Create Dataset from CSV workflow generates both a dataset and a schema dynamically. Since there is no predefined knowledge about whether the schema represents an attribute schema (such as XDM Individual Profile), an event schema (such as XDM Experience Event), or any other standard schema, the resulting schema is referred to as an ad hoc schema. This is because the system does not automatically categorize the dataset under any specific schema type that would typically allow it to be used in predefined workflows or processes.
Click on the schema and then click on Labels. You will see a screen that looks like Here you can choose an individual field and apply a C2 contract label to it. Click on the pencil icon
Choose the C2 Contract Label and click Save
You will see the labels applied
In general, all datasets that use this schema will have this field blocked from being exported out.
A Note on Adobe Analytics Data Feeds
Please read this tutorial on extracting the fields:
Please read the following tutorial on extracting data from nested structures like arrays and maps such as the Identities from an
identityMap:
Hit-Level Data and Identification in Adobe Experience Platform
In Adobe Experience Platform, hit-level data, traditionally collected by Adobe Analytics, is stored as timestamped event data. This section provides guidance on how to map specific Adobe Analytics Data Feed columns to XDM fields in the Experience Platform. It also shows how hits, visits, and visitors are identified using XDM fields.
Hit Identification
In the Experience Platform, each "hit" represents an event triggered by a user action, such as a page view or link click, and is identified by a combination of hitid_high
and hitid_low
. These fields are essential for tracking each unique event or interaction.
Visit and Visitor Identification
Visits and visitors are identified using various identity fields in the Experience Platform. The combination of visid_high
and visid_low
forms a unique identifier for each visit. Additionally, customer-specific visitor IDs (e.g., cust_visid
) and geolocation data are stored in the identityMap
structure.
Commerce and Event Tracking
Commerce events such as purchases, product views, and checkouts are critical for e-commerce use cases. These events are represented in XDM fields such as commerce.purchases
, and custom events can be tracked using _experience.analytics
fields.
Important: Post-Processing Columns
Adobe Analytics uses columns prefixed with post_
to represent data after processing. However, in the Experience Platform, there is no concept of post-processing fields for datasets collected through the Experience Platform Edge Network (Web SDK, Mobile SDK, Server API). Consequently, both pre- and post-processed data feed columns map to the same XDM field.
For example, both
page_url
andpost_page_url
map toweb.webPageDetails.URL
.
Post Processing of Data
Performing transformations like sessionization, attribution, and deduplication in your queries requires leveraging Data Distiller functions.
Sessionization
Sessionization is used to group individual hits into logical sessions based on user interactions within a given time frame. The key ideas that we shall use are the following:
Sessionization: The
SESS_TIMEOUT
function groups user events into sessions based on a timeout period (30 minutes in this case). A new session is started if no activity occurs within the timeout window.Ingestion Time Tracking: The script tracks the start and end times of batch ingestion. It uses this information to process only new data and update the checkpoint logs for future reference.
Checkpoint Logs: This process logs the status of each batch in
checkpoint_log
, making it easy to track the state of data processing.Anonymous Block:
Sample code for Sessionization
Let us dive into the query in detail:
Disable Dropping System Columns
This command disables the automatic removal of system columns. It ensures that columns like metadata (_acp_system_metadata.ingestTime
) are retained and can be referenced later in the query.
Initialize Variables
This statement initializes the variable @last_updated_timestamp
with the current timestamp. This timestamp is later used to record when the batch process is completed.
3. Get the Last Processed Batch Ingestion Time
This block determines the time of the last successful batch ingestion by:
Looking at the
checkpoint_log
table for entries whereprocess_name
isdata_feed
andprocess_status
isSUCCESSFUL
.The
coalesce
function ensures that if there's no previous ingestion time (first run), it uses the default value'
HEAD
'
.The
MAX(process_timestamp)
fetches the most recent batch ingestion time.
Get the Last Batch Ingestion Time
This fetches the maximum ingestion time (_acp_system_metadata.ingestTime
) from the events_dataset
to determine when the most recent batch of data was ingested.
Sessionize the Data and Insert it into
data_feed
This section performs sessionization and data insertion:
Inner Query (Alias:
a
):Extracts relevant fields like
userIdentity
,timestamp
,page_name
, andingest_time
fromevents_dataset
.Filters records to only include those within the past 90 days (
timestamp >= current_date - 90
).
Sessionization (
SESS_TIMEOUT
):Uses the
SESS_TIMEOUT
function to create session boundaries with a 30-minute (1800 seconds) timeout.The
OVER
clause applies sessionization logic by partitioning the data byuserIdentity
and ordering bytimestamp
.Each row is assigned a session identifier based on user activity within a 30-minute window of inactivity.
Outer Query (Alias:
b
):Selects and orders data based on
userIdentity
andtimestamp
.Filters the result set to only include data that has been ingested since the last batch ingestion (
b.ingest_time >= @from_batch_ingestion_time
).
Insert:
Inserts the sessionized data into the
data_feed
table.
Update the Checkpoint Log
This inserts a new entry into the checkpoint_log
table with:
process_name:
'data_feed'
.process_status:
'SUCCESSFUL'
, indicating that the batch was successfully processed.last_batch_ingestion_time: The most recent ingestion time (from
@to_batch_ingestion_time
).process_timestamp: The timestamp when the process was completed (from
@last_updated_timestamp
).
Attribution
Attribution functions can be used to assign credit to different touchpoints in a user's journey based on predefined rules (e.g., last touch attribution).
Sample Code
This attribution query adds attribution logic to capture the first touch point for URLs and channel types.
Attribution Logic
In the second SQL query, two attribution functions are used:
These functions introduce first-touch attribution, which differs from the simple sessionization logic in the first query. The above query uses the attribution_first_touch()
Data Distiller function to identify the first event (or touchpoint) for each session or user, capturing the first URL visited and the first channel type used during a session.
Last updated