DDA 200: Build Data Distiller Audiences on Data Lake Using SQL

Unleash the full potential of your data with Data Distiller—where advanced audience creation meets real-time insights, scalability, and unmatched personalization.

Overview

The Adobe Experience Platform (AEP) Data Lake is a comprehensive data hub that brings together datasets from a wide array of Adobe applications, including Adobe Analytics, Adobe Campaign, Adobe Audience Manager, Marketo, and Adobe Commerce. When you combine this wealth of data with that from Adobe Real-Time Customer Data Platform, Adobe Journey Optimizer, Customer Journey Analytics, Adobe Mix Modeler, Marketo Measure, and Adobe GenStudio, you have the entire Adobe ecosystem at your fingertips—empowering you to personalize at scale like never before. These diverse datasets provide a rich foundation for creating highly targeted and dynamic segments, helping you better understand and engage your customers.

Beyond raw customer data, each of these systems generates system datasets that contain critical information about personalization and customer interactions. Whether it's journey insights from Adobe Journey Optimizer or behavioral data from Adobe Analytics, the AEP Data Lake allows you to access all of this in one place. This unified data source enables you to build Data Distiller Audiences, allowing you to craft audiences based on highly granular customer behaviors and interactions across all channels. With all this data readily available, you are not just building audiences—you are orchestrating personalized, omnichannel experiences that meet customers wherever they are, powered by the full strength of Adobe’s ecosystem.

Benefits of Data Distiller Audiences

Data Distiller stands as the go-to solution for modern audience authoring, offering unmatched flexibility, scalability, and analytical power. When it comes to building personalized, data-driven marketing strategies, Data Distiller brings numerous benefits that make it superior to other audience creation tools.

SQL Power

At the core of Data Distiller's advantage is its ability to leverage SQL, a universally recognized language for database marketing. SQL’s expressiveness provides flexibility and control, allowing you to create detailed audience segments with precision. Unlike restrictive point-and-click tools, SQL gives you the freedom to write complex queries that target specific behaviors, characteristics, or actions. Whether you're segmenting based on purchase behavior, demographics, or engagement history, SQL enables you to fine-tune audience definitions, making it easier to engage the right customers at the right time.

True Behavior-Based Audiences

Modern audience engines often struggle to process large volumes of raw event data efficiently. Data Distiller eliminates this bottleneck. With its ability to seamlessly handle massive amounts of event data and deeply nested structures, Data Distiller goes beyond the capabilities of most campaign tools. This allows you to create highly nuanced audiences based on complex behaviors that would otherwise be difficult to define. Data Distiller also provides powerful data transformation capabilities, making it possible to extract, manipulate, and process data in ways that other systems cannot. Custom joins across datasets let you combine diverse sources of identity information, resulting in a level of audience precision and customization that is unparalleled.

Unmatched Scale

One of the most impressive aspects of Data Distiller is its ability to operate at scale. Whether you're working with millions or billions of records, Data Distiller can handle these datasets effortlessly. This scalability enables the creation of high-value audiences that can drive sophisticated, personalized marketing campaigns without sacrificing performance or speed. Large-scale personalization initiatives, which once seemed daunting, can now be executed efficiently, giving brands the ability to target and engage vast customer bases with ease.

Real-Time or Batch Personalization

You can seamlessly activate audiences in real-time or batch across Adobe's Real-Time Customer Data Platform destinations and Adobe Journey Optimizer. This flexibility allows businesses to shift between batch-based audience creation for long-term insights and real-time activation for immediate, contextually relevant customer engagement. n addition to real-time activations, Data Distiller Audiences can also be enriched with personalization attributes and activated to batch or file-based destinations.

Advanced Audience Orchestration

Data Distiller is more than just an audience segmentation tool; it offers a flexible audience orchestration system. With the ability to create conditional branching and mix-and-match criteria, you can build complex audience structures based on multiple conditions. This allows for the orchestration of highly intricate customer experiences, taking into account diverse data points like behavior, identity, and engagement history. By combining these conditions, you can seamlessly manage customer journeys across touchpoints, delivering truly personalized experiences.

Leverage Audiences from Real-Time Customer Profile

Another key advantage of Data Distiller is its ability to integrate with Profile Snapshot datasets. You can mix and match audience memberships from profiles natively authored in Real-Time Customer Profile, along with identity compositions. This enhances audience precision by merging identity data with real-time audience insights, resulting in more accurate segmentation and better personalization. This flexibility allows for the seamless combination of historical data with real-time insights, unlocking richer and more actionable audience segments.

Cross-Platform Integration with External Audiences

Data Distiller isn’t limited to the audiences on AEP Data Lake alone. You can also integrate it with external audiences created by Adobe Audience Manager and Federated Audience Composition to extend your segmentation strategies. This cross-audience capability allows you to blend SQL-based audiences with those created in external platforms, offering even greater flexibility in personalization and reach. By leveraging multiple segmentation strategies, you can ensure that your audiences are fully optimized for different campaigns and marketing goals.

Advanced Analytics and Insights with Data Distiller

Data Distiller does more than just audience segmentation; it opens up a world of advanced analytics. With its ability to generate deep insights, statistical models, and hypercubes, Data Distiller provides marketers with the tools they need for advanced audience analysis. You can process vast amounts of data from multiple sources to uncover key trends, build multidimensional views of customer segments, and extract actionable insights. Whether you need to explore patterns in customer behavior, run advanced statistical models, or dive deep into the data, Data Distiller is equipped to fuel your marketing strategies with powerful insights.

Superior Privacy and Governance

One of the standout advantages of Data Distiller Audiences built on the AEP Data Lake is the enhanced governance and privacy controls. Since the AEP Data Lake is natively integrated with Adobe Experience Platform's Trust Framework, it benefits from robust data governance, ensuring that every dataset adheres to the strictest compliance, security, and privacy standards. This level of control is critical when dealing with customer data, especially in an environment where regulations like GDPR, CCPA, and other data protection laws are paramount.

Prerequisites

You will need this to upload the test data:

PREP 500: Ingesting CSV Data into Adobe Experience Platform

You may encounter situations such as retrieving other external audience information may require you to work with complex data structures like maps and also with Profile Snapshot datasets. You should read this tutorial to familiarize yourself with these:

IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller

We will be using the following data to create segments:

1MB

email_campaign_dataset.csv

Open

You should also familiarize yourself with some statistics as well as we take a different approach to understanding the data in this tutorial:

STATSML 400: Data Distiller Basic Statistics Functions

Retail Case Study: Optimizing Email Marketing Campaigns with Audience Segmentation and A/B Testing

A retail brand is running a series of email marketing campaigns for its Spring Sale, Holiday Offers, and New Arrivals. The marketing team wants to:

Decide on Audience Strategy: They need to determine a segmentation strategy: Should they prioritize loyalty models to guide their campaigns, or shift focus back to engagement? Segment the audience based on engagement or loyalty by creating targeted messaging. For example, craft specific messages for those who have opened emails but haven’t clicked (warm leads) versus those who haven’t engaged at all (cold leads).
[Not in this tutorial] A/B Test Subject Lines: Compare two email subject lines for the same campaign to see which one drives more engagement (open and click rates).
Monitor and Reduce Email Bounces: Track and reduce email bounces by analyzing hard and soft bounces to refine the email list and improve targeting.
Personalized Marketing: Use engagement metrics (like open and click counts) to create personalized follow-up campaigns, offering exclusive deals or reminders based on customer interaction behavior.

In our scenario, we need to send audiences based on campaign performance within the Adobe ecosystem to a third-party activation system. While this approach is not ideal, it is a necessity driven by our architecture, a practice that is common in many organizations. The objective is to leverage system datasets in the Adobe Experience Platform to create new audiences that can be sent to other systems in batch, without significantly increasing the data volume in the Real-Time Customer Profile.

In this case study, the data is simulated to reflect the performance metrics typically seen in campaign tools like Adobe Journey Optimizer or Adobe Campaign. The test data has been simplified into a canonical form, allowing us to focus more on audience design and activation rather than data transformation. For a comprehensive guide on how to utilize similar data from Adobe Journey Optimizer, refer to the tutorial available here.

Expected Outcomes

Higher Engagement: By tracking open and click rates, the marketing team can focus on the most effective content, leading to higher engagement and ultimately increased sales.
Improved Targeting: Customer segmentation based on interaction helps in tailoring future messages, leading to better personalization and increased likelihood of conversion.
[Not in this tutorial] Optimized Content: A/B testing results will provide insights into what content or subject lines resonate most with the audience, enabling the brand to optimize its messaging.
Reduced Bounce Rates: Understanding bounce types (hard or soft) will allow the marketing team to clean up the email list, ensuring better deliverability and engagement metrics.

Focus of this Tutorial

There are separate tutorials that will cover topics on advanced audience orchestration (for optimized content) and activation:

[DRAFT]DDA 202: Data Distiller Audience Orchestration ACT 100: Dataset Activation with Data Distiller

Ingest Test Data

Follow the steps outlined in the tutorial to ingest the CSV file listed above.

High Level Summary of Data Distiller Audience Commands

CREATE AUDIENCE

CREATE AUDIENCE highly_engaged_audience
WITH (primary_identity = email, identity_namespace = Email)
AS (
    SELECT 
        customer_id, 
        email, 
        campaign_name, 
        open_count, 
        click_count,
        CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
    FROM email_campaign_dataset_20241001_050033_012
    WHERE open_count > 0 AND click_count > 0
);

INSERT INTO

INSERT INTO highly_engaged_audience
   (SELECT 
        customer_id, 
        email, 
        campaign_name, 
        open_count + 1 AS open_count, 
        click_count + 2 AS click_count, 
        CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
    FROM email_campaign_dataset_20241001_050033_012
    WHERE open_count > 0 AND click_count > 0)

DROP AUDIENCE

DROP AUDIENCE highly_engaged_audience;

Exploratory Analysis to Define Audience Strategy

Campaign Analysis

Campaign Performance

Before you run the queries, just double check that you have the right dataset. Navigate to Datasets->Browse to locate the dataset. Click on it and retrieve the table name. This is covered in the tutorial here. Be aware that it may append new characters to the table name as shown in the query below.

SELECT 
    campaign_name,
    COUNT(*) AS total_emails_sent,
    SUM(open_count) AS total_opens,
    SUM(click_count) AS total_clicks,
    ROUND(SUM(open_count) / COUNT(*), 2) AS open_rate,
    ROUND(SUM(click_count) / SUM(open_count), 2) AS click_through_rate
FROM email_campaign_dataset_20241001_050033_012
GROUP BY campaign_name
ORDER BY open_rate DESC;

The result will be:

As an analyst, it's essential to document or summarize the insights derived from the data. In the example above, you're interpreting raw data to draw meaningful conclusions. As you move through this section, notice how advanced statistical analysis can make it easier to describe the data and support your observations. Even if you're not deeply involved in statistics, having that analysis to back up your findings is always valuable.

Flash Discounts has the highest open rate (5.17), meaning that the email subject or content was particularly engaging. However, the click-through rate is on the lower end compared to others. This suggests that while many customers opened the email, fewer were motivated to click, possibly indicating that the content or call-to-action inside the email could be optimized.
Holiday Offers performs well overall, with both a strong open rate (5.08) and a fairly high click-through rate (0.49). This suggests that the campaign is well-targeted, and the messaging inside the email effectively encourages customer engagement. This campaign is performing relatively well and might serve as a benchmark for future campaigns.
New Arrivals has a similar open rate to Holiday Offers, at 5.01, and matches its click-through rate at 0.49. This suggests that the customers are equally interested in this content, but again, while opens are strong, there’s still room for improving the conversion rate from clicks to action.
Spring Sale has the lowest open rate (4.98) among the four campaigns, though the difference is small. However, it compensates with the highest click-through rate (0.50), suggesting that while fewer people opened the email, those who did were highly engaged and more likely to click. This indicates that the content was persuasive for the subset of users who opened the email, but perhaps the subject line could be improved to increase opens.

Best Performing Email Subject Lines

Identify which email subject lines drive the highest engagement.

SELECT 
    email_subject,
    campaign_name,
    COUNT(*) AS total_emails_sent,
    SUM(open_count) AS total_opens,
    SUM(click_count) AS total_clicks,
    ROUND(SUM(open_count) / COUNT(*), 2) AS open_rate,
    ROUND(SUM(click_count) / SUM(open_count), 2) AS click_through_rate
FROM email_campaign_dataset_20241001_050033_012
GROUP BY email_subject, campaign_name
ORDER BY click_through_rate DESC;

This query allows you to compare how different email subject lines perform across campaigns in terms of engagement.

The results will look like this:

Most Popular Subject Lines by Campaign

You'll need to use a window function to identify the top result within each campaign, and then select the highest-ranking element from each partition.

SELECT 
    campaign_name,
    email_subject,
    total_opens
FROM (
    SELECT 
        campaign_name,
        email_subject,
        SUM(open_count) AS total_opens,
        ROW_NUMBER() OVER (PARTITION BY campaign_name ORDER BY SUM(open_count) DESC) AS rank
    FROM email_campaign_dataset_20241001_050033_012
    GROUP BY campaign_name, email_subject
) AS ranked_subjects
WHERE rank = 1;

The result looks like:

The most popular subject lines across the campaigns indicate that customers respond strongly to messages that emphasize urgency and discounts. For the Flash Discounts campaign, the top subject line, "Special Offer Just for You!", suggests that personalized and exclusive offers resonate well with the audience, driving 3,394 opens. Similarly, Holiday Offers and New Arrivals both found success with the subject line "Don’t Miss Out on Big Discounts," which highlights that customers are motivated by the prospect of large savings. The Spring Sale campaign's most successful subject line, "Limited Time Deal!", reinforces the effectiveness of urgency in encouraging engagement. Overall, these subject lines show that messaging focused on exclusive deals and time-sensitive offers performs well across various campaigns.

Data Distiller Statistics: Basic Analysis

Descriptive Statistics

Descriptive statistics give you a summary of the central tendency and dispersion of your key metrics (e.g., open count, click count, purchase value, etc.).

SELECT 
    ROUND(AVG(open_count), 2) AS avg_open_count,
    ROUND(STDDEV(open_count), 2) AS stddev_open_count,
    
    ROUND(AVG(click_count), 2) AS avg_click_count,
    ROUND(STDDEV(click_count), 2) AS stddev_click_count,
    
    ROUND(AVG(avg_purchase_value), 2) AS avg_purchase_value,
    ROUND(STDDEV(avg_purchase_value), 2) AS stddev_purchase_value,
    
    ROUND(AVG(purchase_frequency), 2) AS avg_purchase_frequency,
    ROUND(STDDEV(purchase_frequency), 2) AS stddev_purchase_frequency,
    
    ROUND(MIN(open_count), 2) AS min_open_count,
    ROUND(MAX(open_count), 2) AS max_open_count,
    
    ROUND(MIN(click_count), 2) AS min_click_count,
    ROUND(MAX(click_count), 2) AS max_click_count
FROM email_campaign_dataset_20241001_050033_012;

The results show the following:

The results show that, on average, customers open emails about 5 times and click on links around 2.46 times. Purchase behavior shows an average purchase value of $254.78, with customers making approximately 25 purchases on average. However, there's noticeable variability, especially in purchase value (standard deviation of $140.88) and purchase frequency (standard deviation of 14.58), indicating that customer spending and purchase habits differ widely. Some customers don't engage with emails at all (0 opens or clicks), while others engage frequently, with a maximum of 9 opens and clicks.

Customer Purchase and Engagement Correlation

Identify whether customer loyalty and purchase behavior correlate with email engagement (opens and clicks).

SELECT 
    customer_loyalty_score,
    AVG(open_count) AS avg_open_count,
    AVG(click_count) AS avg_click_count,
    AVG(avg_purchase_value) AS avg_purchase_value,
    AVG(purchase_frequency) AS avg_purchase_frequency
FROM email_campaign_dataset_20241001_050033_012
GROUP BY customer_loyalty_score
ORDER BY customer_loyalty_score DESC;

This query helps identify trends between customer engagement in email campaigns and their purchase behavior, providing insights into which customer segments are most valuable.

Let us now compute the correlation between loyalty score and the other metrics:

SELECT 
    'Open Count' AS metric,
    CORR(customer_loyalty_score, open_count) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL

SELECT 
    'Click Count' AS metric,
    CORR(customer_loyalty_score, click_count) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL

SELECT 
    'Purchase Value' AS metric,
    CORR(customer_loyalty_score, avg_purchase_value) AS correlation
FROM email_campaign_dataset_20241001_050033_012

UNION ALL

SELECT 
    'Purchase Frequency' AS metric,
    CORR(customer_loyalty_score, purchase_frequency) AS correlation
FROM email_campaign_dataset_20241001_050033_012;

The results indicate that there is no significant correlation between customer loyalty score and key engagement metrics such as open count, click count, purchase value, and purchase frequency. The Pearson correlation coefficients for all these variables are close to zero, suggesting that higher loyalty scores do not predict increased email engagement or purchase behavior. Whether customers have a high or low loyalty score does not appear to impact how often they open or click emails, or how much and how frequently they make purchases. This means that in this dataset, loyalty score is not a strong indicator of customer actions, and other factors may need to be explored to better target and engage customers.

The CORR() function in SQL calculates the Pearson correlation coefficient between two numeric columns. It measures the linear relationship between two variables, producing a value between -1 and 1:

1: A perfect positive correlation, meaning as one variable increases, the other also increases.
0: No correlation, meaning there's no linear relationship between the two variables.
-1: A perfect negative correlation, meaning as one variable increases, the other decreases.

Data Distiller Statistics: Advanced Analysis

SELECT 
    -- Median calculations
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY open_count) AS median_open_count,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY click_count) AS median_click_count,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY avg_purchase_value) AS median_purchase_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY purchase_frequency) AS median_purchase_frequency,

    -- Mode calculations for open_count
    (SELECT open_count
     FROM email_campaign_dataset_20241001_050033_012
     GROUP BY open_count
     ORDER BY COUNT(*) DESC
     LIMIT 1) AS mode_open_count,

    -- Mode calculations for click_count
    (SELECT click_count
     FROM email_campaign_dataset_20241001_050033_012
     GROUP BY click_count
     ORDER BY COUNT(*) DESC
     LIMIT 1) AS mode_click_count,

    -- Mode calculations for avg_purchase_value
    (SELECT avg_purchase_value
     FROM email_campaign_dataset_20241001_050033_012
     GROUP BY avg_purchase_value
     ORDER BY COUNT(*) DESC
     LIMIT 1) AS mode_purchase_value,

    -- Mode calculations for purchase_frequency
    (SELECT purchase_frequency
     FROM email_campaign_dataset_20241001_050033_012
     GROUP BY purchase_frequency
     ORDER BY COUNT(*) DESC
     LIMIT 1) AS mode_purchase_frequency,

    -- Kurtosis calculations
    KURTOSIS(open_count) AS kurtosis_open_count,
    KURTOSIS(click_count) AS kurtosis_click_count,
    KURTOSIS(avg_purchase_value) AS kurtosis_purchase_value,
    KURTOSIS(purchase_frequency) AS kurtosis_purchase_frequency,

    -- Skewness calculations
    SKEWNESS(open_count) AS skewness_open_count,
    SKEWNESS(click_count) AS skewness_click_count,
    SKEWNESS(avg_purchase_value) AS skewness_purchase_value,
    SKEWNESS(purchase_frequency) AS skewness_purchase_frequency

FROM email_campaign_dataset_20241001_050033_012;

Observe the functions used to compute the median and mode in Data Distiller

Median represents the middle value in a dataset and helps to provide a clear picture of the typical values, unaffected by extreme outliers. In this case, the median open count is 5, indicating that half of the customers opened emails 5 times or fewer, while the other half opened emails more than 5 times. Similarly, the median click count is 4, suggesting that customers typically clicked on email links 4 times or fewer. For purchase behavior, the median average purchase value is 275.0, meaning that half of the customers made average purchases of $275 or less, while the other half made purchases of more than $275. Finally, the median purchase frequency is 25, indicating that half of the customers made 25 or fewer purchases, reflecting a balanced distribution across the dataset.

Mode reflects the most frequently occurring value within a dataset, providing insights into the most common behaviors. In this analysis, the mode for the open count is 5, meaning that most customers opened emails 5 times, making this the most frequent behavior. Interestingly, the mode for click count is 0, suggesting that the most common behavior was for customers to open emails without clicking on any links. The mode for average purchase value is 275.0, which indicates that $275 was the most frequent purchase amount among customers. The most frequent purchase frequency is 46, showing that many customers made 46 purchases, highlighting a common engagement level within the dataset.

The reason the median and mode seem so similar is likely because, in our dataset, the values for median and mode happen to be close in magnitude for some metrics, such as open count (median 5, mode 5) and average purchase value (median 275, mode 275). However, while the values are close, median and mode describe different characteristics of the dataset:

Median represents the middle point of the data distribution, meaning half the values are below and half are above.
Mode represents the most frequently occurring value in the dataset, which shows the most common behavior.

Skewness measures the asymmetry of a data distribution, indicating whether the values tend to cluster more on one side of the mean. In this dataset, the skewness for open count is -0.53, showing a slight negative skew, meaning more customers have higher open counts, but the skewness isn’t extreme. For click count, the skewness is 0.96, which indicates a positive skew, where most customers have lower click counts, but a few customers click significantly more often. The skewness for average purchase value is very close to zero (0.03), suggesting a nearly symmetrical distribution, with purchase values evenly spread around the mean. Similarly, the purchase frequency has a skewness of -0.009, also indicating a nearly symmetrical distribution, meaning customers’ purchase frequencies are balanced on both sides of the average, with no significant bias toward higher or lower frequencies

Kurtosis is a measure that describes how much of the data is concentrated in the tails of a distribution, indicating how extreme the values are. In this dataset, the negative kurtosis values for open count (-4.252), purchase value (-4.193), and purchase frequency (-4.217) suggest that the distributions are flatter than a normal bell-shaped curve, meaning there are fewer extreme behaviors, such as very high or very low values, and most customers’ behaviors are closer to the average. The click count, with a kurtosis of 0.073, is close to a normal distribution, indicating a more typical spread of values with a balanced number of extremes. Overall, the negative kurtosis values reflect that most customers behave moderately, and extreme behaviors, such as very high engagement or large purchase amounts, are less common.

Approach on Audience Strategy

Based on the above analysis, options for us to explore are:

Rather than relying on loyalty score, segment customers based on their past engagement metrics (e.g., high openers vs. low openers, frequent purchasers vs. occasional buyers). This is the approach we will take in this tutorial.
[Not in this tutorial] Perform A/B testing to experiment with subject lines across various customer segments. The most popular subject lines across the campaigns indicate that customers respond strongly to messages that emphasize urgency and discounts.
Focus on personalization by using customer purchase history, browsing behavior, or other real-time data. Personalized offers and messaging can improve engagement and conversions more effectively than broad targeting based on loyalty score. An example of this is doing RFM or FRE modeling as shown in this tutorial here.
[Not in this tutorial] Consider if timing affects customer engagement. For example, sending campaigns based on previous purchase dates or behavioral signals (e.g., abandoned carts) may improve response rates more than relying on static scores like loyalty.

Identify Highly Engaged Customers

Create a list of customers who have opened and clicked on emails, marking them as highly engaged.

SELECT 
    customer_id,
    email,
    campaign_name,
    open_count,
    click_count,
    CASE 
        WHEN open_count > 0 AND click_count > 0 THEN 'Highly Engaged'
        WHEN open_count > 0 AND click_count = 0 THEN 'Opened, No Click'
        ELSE 'No Engagement'
    END AS engagement_level
FROM email_campaign_dataset_20241001_050033_012
ORDER BY engagement_level;

This query segments customers into "Highly Engaged," "Opened, No Click," and "No Engagement."

Audience Size Estimation

It is recommended to perform audience estimation and debugging before creating Data Distiller Audiences.

Let us execute the following audience estimation query:

SELECT 
    engagement_level,
    COUNT(DISTINCT customer_id) AS audience_size
FROM (
    SELECT 
        customer_id,
        email,
        campaign_name,
        open_count,
        click_count,
        CASE 
            WHEN open_count > 0 AND click_count > 0 THEN 'Highly Engaged'
            WHEN open_count > 0 AND click_count = 0 THEN 'Opened, No Click'
            ELSE 'No Engagement'
        END AS engagement_level
    FROM email_campaign_dataset_20241001_050033_012
) AS engagement_data
GROUP BY engagement_level
ORDER BY audience_size DESC;

Create a Data Distiller Audience

Before executing the query below, keep in mind that this will use the Batch Compute Engine in Data Distiller. It will create a new dataset and update an existing one marked for Real-Time Customer Profile. Batch jobs typically take about ~5 minutes for the cluster to spin up and down, so expect the query to take around ~10 minutes to complete.

It’s recommended to use SQL queries with limits while prototyping your audience queries to avoid timeouts. Once your audience queries are finalized, you can use the Data Distiller Orchestration Anonymous Block feature to wrap all audience creation steps into a single statement, significantly saving time. The CREATE AUDIENCE command works similarly to CREATE TABLE, but it can also write segment membership and enriched attributes to external audiences in Real-Time Customer Profile.

We will now create the highly engaged customer audience which will include customers who have both opened and clicked emails.

CREATE AUDIENCE highly_engaged_audience
WITH (primary_identity = email, identity_namespace = Email)
AS (
    SELECT 
        customer_id, 
        email, 
        campaign_name, 
        open_count, 
        click_count,
        CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
    FROM email_campaign_dataset_20241001_050033_012
    WHERE open_count > 0 AND click_count > 0
);

Let’s take a closer look at the columns we are selecting and their purpose:

customer_id: This is likely a CRM-style identifier, essential for activating systems across the enterprise. It is associated with additional information like the customer's first and last name, which can be pulled from a central system.
email:This serves as the primary identity in the Adobe Experience Platform, used for personalization and identity stitching. This will also be reused in the target system for email campaigns.
campaign_name:Represents the name of the campaign running in Adobe Journey Optimizer. This can provide descriptive feedback to the target activation system.
open_count/click_count:These are key engagement metrics that can be leveraged for further segmentation in the downstream email campaign system.
event_timestamp:This field is considered a best practice because it helps track when the audience record was materialized across different parts of the platform. As you progress through this tutorial, you'll understand the importance of this choice.

Observe the syntax above:

WITH (primary_identity=email, identity_namespace=Email)

There are quotes in the column name (email) that we are using for primary identity and the identity namespace (Email).

Keep in mind that the primary identity is used by Real-Time Customer Profile to partition and store data without affecting the identity graph. Data Distiller Audiences do not impact the identity graph, allowing you to create ad hoc audiences without worrying about altering the behavior of existing audiences or personalization created in Real-Time Customer Profile.

Additionally, the derived attributes we’ve added, such as customer_id, campaign name, open_count, and click_count, are not available for segmentation within Real-Time Customer Profile. These attributes are used exclusively for personalization or activation in the supported destinations which are batch or file-based destinations.

Remember, Data Distiller audiences are evaluated by Profile only at the time of creation. Since the associated derived attributes for these imported audiences are non-durable and not stored in the Profile store, the audience will only be updated if it is manually refreshed.

Data Distiller Audiences (much like any other external audience) only support flat, relational tables for audience creation, meaning nested data structures like arrays, maps, or other types of nesting are not allowed—the audience must remain flat. If you're working with a nested dataset, you can extract the necessary fields to create a flat table by using the SELECT statement with dot notation, which accesses specific fields within nested structures.

Additionally, when activating a Data Distiller Audience, regardless of the export format (CSV, JSON, Parquet), the data will always maintain a flat structure.

If your destination requires custom audience nested schemas, you'll need to take the dataset activation route. For use cases that involve identity graphs or profile data from Real-Time Customer Profile, the best approach is to combine Profile snapshot data with these datasets to create metadata-rich datasets tailored for specific destination needs.

What is an Identity Namespace?

Note that the identity namespace refers to the type of identity, allowing you to differentiate between, for example, an identity based on email and one based on a cookie. This distinction helps you define which types of identities you're willing to accept and provides insights into the omnichannel nature of your customers' identities.

To retrieve the Identity Namespace for email, navigate to Identities->Browse and search for email. The Type is the identity namespace that we need.

If you navigate to Datasets->Browse->highly_engaged_audience, you will see that a dataset is created for this audience. Also, observed that it is not enabled for Profile (the radio button for Profile is disabled).

Real-Time Customer Profile and Data Distiller Audiences

During the import process for a Data Distiller audience, you need to specify which column corresponds to the Primary Identity, such as an email address, ECID, or a custom identity specific to your organization. The data associated with this Primary Identity is the only information linked directly to the profile. If no existing profiles match the data in the Primary Identity column, a new profile is created. However, this new profile remains isolated, with no associated attributes or experience events.

The remaining data in the Data Distiller audience is considered payload attributes. These attributes can be used for personalization and enrichment during activation but are not directly attached to the profile itself. Instead, they are stored in the data lake.

While a Data Distiller audience can be referenced when building new audiences with the Segment Builder, individual profile attributes within the audience cannot be utilized independently.

Data Distiller Audience Datasets

When you create a Data Distiller audience, a dataset is created and appears in the dataset inventory. The dataset's name will match the name of the Data Distiller audience you created.

Let us execute the query:

SELECT * FROM highly_engaged_audience;

The structure of the dataset mirrors the SELECT query. As you add records to this dataset, you will see them appending. The results are:

The purpose of this dataset is to provide derived attributes (customer_id, campaign name, open_count, click_count and event_timestamp) that will be used for personalization or activation in a destination. During activation, the destination retrieves the necessary values from this dataset.

Audience Portal Dataset

If this dataset wasn’t marked for Profile, how did the data get into the Profile?

The secret is the Audience Portal Dataset for UPS Ingestion dataset with the name as audience_portal_dataset_for_ups_ingestion.

If we preview the dataset by:

SELECT * FROM audience_portal_dataset_for_ups_ingestion;

The results will be the following:

Each record from our audience is ingested into Real-Time Customer Profile and is a row within this dataset. Note that it does not contain the derived attributes (customer_id, campaign name, open_count, and click_count) in the audience.

Execute the following query with the to_json command

SELECT to_json(segmentMembership) FROM audience_portal_dataset_for_ups_ingestion;

segmentMembership

The segment ID (DDA -> [943aa6a9-d43a-4399-8d6a-782eeb4524f4) uniquely identifiees this as a Data Distiller Audience (DDA) with audience ID as 943aa6a9-d43a-4399-8d6a-782eeb4524f4 for the highly_engaged_audience that we created. If you are using other external audiences, they will have a different acronym such as AAM for Adobe Audience Manager.
The status "realized" indicates that the profile has qualified for the audience, and the timestamp reflects the most recent qualification time. If the profile qualifies again on the following day, a new record will be added with an updated timestamp.
lastQualificationTime tells you when that profile qualified for the segment. If you were to publish the same profile back into the audience again, it will create a new entry in this dataset where the lastQualificationTime would be updated.

identityMap

This is the primary identity listing for the profile in that audience.

Note that there are two filters you will need to apply to extract the members of an external audience from the audience_portal_dataset_for_ups_ingestion dataset: audience ID and realized timestamp. To work with how to extract this data from maps, you can read up the tutorial here.

Exploring Audience Portal in Adobe Real-Time Customer Data Platform

Let us now navigate to the audiences screen:

If you are struggling to find Data Distiller audiences, you can click on the Filter icon and then choose the Data Distiller option:

Let us go ahead and click on the audience:

Also, observe the audience size which directly corresponds to 6.47K records.

Keep in mind that Data Distiller audiences, like any other external audience, will follow the default merge policy, which can affect entitlements. It willl increase the number of profiles you are consuming for profiles that came in from the datqset that do not exist in Profile. From a Profile Storage standpoint, If you are not using these audiences, you should delete them via the DROP command.

If you scroll down to the sample profiles at the bottom and select one, you will see:

If we click on the attributes, we won't find any of the derived attributes that were present in the highly_engaged_audience dataset in the AEP Data Lake.

If we click on Audience membership:

Observe the details on Audience ID and the Last Modified timestamp (local time) which is 12:15 AM PST equivalent to 7:15PM UTC. You will see that it is close to the UTC timestamp (7:22pm) to the records in the Audience Portal Dataset for UPS Ingestion dataset. This means that the ingestion into Real-Time Customer Profile started at 7:15PM UTC and the realized status was marked around 7 minutes after. If we check the time when the highly_engaged_audience dataset was created and updated around the same time.

Data Distiller Audiences are published to the Real-Time Customer Profile in batches and become available quickly once the Data Distiller job is complete. Typical latency can vary from minutes to about an hour depending on the data ingestion load on the Real-Timee Customer Profile system. The low latency is ideal for ad hoc batch segmentation, especially since the current support for batch segmentation in Real-Time Customer Profile has a 24-hour processing window.

When managing audiences in Data Distiller, you can delete or DROP an audience as long as it isn't actively referenced or utilized in another audience, destination, or Adobe Journey Optimizer. Ensuring the audience isn't linked elsewhere prevents any unintended disruptions to workflows or segmentations that might rely on that audience. If an audience is still in use, it should first be unlinked or replaced in those areas before deletion can proceed.

Adding or Updating a Data Distiller Audience

Before we begin, it's important to note that all data added to datasets in the AEP Data Lake is append-only. Any new data, whether it's a fresh row or an update, will be added as new rows to the dataset.

A day passes, and let's assume that every customer in the highly engaged audience has both opened and clicked once more. To simulate this scenario, we will insert the same batch of individuals who re-qualified today but with a different timestamp.

INSERT INTO highly_engaged_audience
   (SELECT 
        customer_id, 
        email, 
        campaign_name, 
        open_count + 1 AS open_count, 
        click_count + 2 AS click_count, 
        CAST(CURRENT_TIMESTAMP AS timestamp) AS event_timestamp
    FROM email_campaign_dataset_20241001_050033_012
    WHERE open_count > 0 AND click_count > 0)

Keep in mind that there's no need to specify a keyword like AUDIENCE or indicate which column is the primary identity. The assumption is that you'll follow the defined schema. If you don’t, it could result in incorrect data being assigned to the primary identity field.

If we execute this query, we will get:

SELECT COUNT(*) FROM highly_engaged_audience

You will see a doubling of the number of records:

Let us access the duplicate records:

SELECT * FROM highly_engaged_audience
ORDER BY customer_id;

Observe CUST00003 customer_id value- it has two records:

If you look up the audience by going to Audiences->Browse->highly_engaged_audience, you will see something like this:

Don't wait for the segment ingestion status to update. Instead, click on the sample profiles, and you'll see that the update may have already been applied.

TheINSERT INTO Data Distiller audience workflow will be used when this audience is being used as part of another audience, destination or Adobe Journey Optimizer since you cannot drop or delete the audience that have these downstream dependencies.

When inserting the same profile through multiple INSERT INTO statements, such as on a scheduled basis across different days, there are two key considerations. First, the lastQualificationTime for that profile will be continuously updated, and the profile will be activated again if it's linked to a destination. Second, if you're using personalization attributes in a file-based destination, all attributes associated with that same profile over all time will be exported, rather than just the most recent version. This is the typical behavior for external audiences in Adobe Experience Platform. In such cases, you will need to use the event_timestamp attribute that we used in the audience to choose the most recent value of the attribute within the destination system.

Building an Edge Audience with a Data Distiller Audience as the Foundation

In this section, we will use the highly_engaged_audience as the foundation for real-time personalization at the edge. When members of this audience visit the website and authenticate via email, we can identify them through the identity graph as part of this audience, enabling us to deliver personalized experiences instantly using tools like Adobe Target or Offer Decisioning. However, it's important to note that members of this audience who do not have an associated cookie ID on the edge will not be eligible for personalization.

Additionally, every edge evaluation triggers both a streaming evaluation on the Hub and a batch evaluation, which serves as a reconciliation process to ensure consistency in segment definitions for the next day. This means that edge audiences are not only available in real-time for activation on the edge but are also synced as streaming audiences for activation on the Hub. The Hub projects identity graph and segment memberships of profiles it has encountered on the edge, maintaining alignment between the edge and central systems for seamless personalization experiences across channels.

We need to make sure that our Default Merge Policy is edge-enabled before we can create an edge segment. Navigate to Profiles->Merge Policies and select the default merge policy. Remember that Data Distiller audiences are associated with default merge policy only.

Turn the slider to the right to turn on the Active on Edge Merge Policy.

Continue navigating through the next few screens until you reach the final one. Click Finish.

Navigate to Audiences->Create Audience

Choose Build Rule. Data Distiller Audiences are not supported in Compose Audience option.

Compose Audience accepts CSV files (up to 1 GB) as external audiences, similar to the Upload CSV workflow mentioned earlier. You can also use third-party tools like DBVisualizer to run your Data Distiller SQL queries, download the entire audience as a CSV, and then manually upload it as a CSV file.

The Audience Builder pane consists of an Attributes pane and an Events pane. Navigate to Fields→ Experience Platform → highly_engaged_audience, and then drag and drop it onto the Attributes pane.

Click on the Convert to Rules icon, and you'll notice that it is grayed out. Typically, it would have given you access to all the rules including those on the Profile attributes. This indicates that Data Distiller audience attributes cannot be used for rule creation in the Rule Builder. This is expected, as mentioned earlier, because the audience does not add attributes to the Profile's attribute set. In fact, when we inspected the Profile earlier, we observed that these attributes were not present.

If you need to use attributes, you'll need to utilize Data Distiller Derived Attributes, which will populate the Real-Time Customer Profile with the necessary attributes in a separate dataset. You can read about it in detail here.

To create a real-time segment, such as an edge or streaming segment, we need an event trigger. Navigate to Fields → Events, scroll down to the Any Event option, and drag it into the Events pane in the Rule Builder. Next, adjust the time-based clause to In the last 2 hours by selecting the appropriate settings from the dropdown menus.

Name the audience as Highly Engaged Audience - Edge and choose the Evaluation Method as Edge.

Click the side button for Evaluation Eligibility Popup.

Real-time audiences (edge or streaming) should rely on precomputed attributes and simple conditions on events within short timespans. If the timespan exceeds 24 hours, the segment will no longer qualify as an edge audience, and anything beyond 7 days makes it impossible to evaluate as a streaming segment. While batch evaluation is the easiest method, batch segments cannot be triggered by events and are only evaluated every 24 hours. This limitation makes real-time engagement challenging. By leveraging Data Distiller-derived attributes or Data Distiller Audiences, you gain far more flexibility in creating real-time audiences. Remember, Adobe Journey segments can be either streaming or batch, depending on the requirements.

Close the popup and Click Publish.

Dropping or Deleting an Audience

Dropping an audience is similar to the DROP TABLEsyntax:

DROP AUDIENCE highly_engaged_audience;

A Data Distiller Audience cannot be deleted if it is being used in another audience, a destination, or Adobe Journey Optimizer, similar to audiences defined within the Real-Time Customer Profile. In such cases, you will receive a message similar to what appears when attempting the same action from the UI:

11:47:06 AM > Query failed in 4.332 seconds.
11:47:06 AM > ErrorCode: 58000 Internal System Error [Failed to delete audience due to: {"requestId":"456aa723-8b42-4fe1-94ec-16131b8ccc0a","errors":{"403":[{"code":"UPAPI-113432-403","message":"The audience cannot be deleted, as it has dependents in PES."}]},"type":"https://ns.adobe.com/aep/errors/UPAPI-113432-403","title":"The audience cannot be deleted, as it has dependents in PES.","status":403}]

In our case, we have created an edge segment in the previous section that uses this. You will need to go to Audiences->Browse and delete Highly Engaged Audience - Edge. Additionally, if you are using this audience in any destination, you will need to remove it from that destination.

You can remove an audience from a Destination by navigating to the Destination Account, Choosing Audience Selecting Activation Data and Remove Audience as shown in the figure below:

Automatic Audience Drop in Real-Time Customer Profile & Override Option

There is 30 day TTL is on segment membership. Any segment membership which has lastQualificationTime beyond 30 days, membership will be cleaned.

Data Distiller Audiences in Profile Snapshot Datasets

Every day, a snapshot of Profile attributes for each merge policy is exported to the AEP Data Lake. These system datasets are hidden by default but can be accessed by toggling them on in the data catalog.

These datasets contain information about profile attributes, the identity map, and segment membership as they existed at the time of the snapshot. The examples below show how to explore this data to better understand identity composition and even create advanced segment overlaps. Data Distiller Audiences are also included in these snapshots, providing the opportunity to build even more complex audiences that span multiple datasets and can be evaluated in batch within the Real-Time Customer Profile.

Data Distiller Audiences are always associated with the default merge policy.

First, you will need to find the default merge policy and find the profile snapshot dataset corresponding to that merge policy. More details are here.

Execute the following query:

SELECT * FROM adwh_dim_merge_policies;

The result will be:

Copy paste the dataset in the Search bar.

Click the dataset and navigate to the right panel to copy the table name.

Execute the following query:

SELECT * FROM profile_attribute_cd03f195_66b5_4a62_a6f9_c606e951c903;

The results will be the following. Scroll to the right and you will see the DDA segment memberships:

Notice once again that the Data Distiller Audience-derived attributes are absent, as expected. However, they are present in the Data Distiller Audience dataset, which Destinations will use to populate its enriched attributes. You can also check the Audience Overview page:

Appendix: Audiences on Data Lake vs. Composable Audiences

For any Customer Data Platform (CDP), the choice between access to raw data versus pre-defined audiences depends on the use cases, flexibility, and control requirements of the CDP. Each approach has its advantages:

Data Lake Approach: Access to Raw Data

Flexibility and Granularity: Raw data provides full granularity and allows more flexibility for creating custom segments, running complex analytics, and deriving unique insights.
Advanced Personalization: With raw data, you can build personalized customer profiles and define complex audience rules tailored to specific marketing strategies.
Historical Data Analysis: Access to raw data enables richer historical analysis, trend forecasting, and a deeper understanding of customer behavior across touchpoints.
Dynamic Audience Creation: Teams can build audiences dynamically, experiment, and adjust targeting based on evolving marketing goals.
Cost and Complexity: Processing raw data often requires more storage, processing power, and expertise to manage data pipelines, making it potentially more resource-intensive.

Composable Approach: Access to Audiences Only

Simplicity and Speed: Using pre-defined audiences can streamline activation by reducing the need for heavy data processing. Teams can focus on deploying campaigns quickly.
Less Overhead: Relying on audience data reduces the burden on storage, processing, and potentially compliance, as only specific attributes of segmented audiences are used.
Efficiency in Activation: Direct access to audience segments supports rapid campaign activation, especially if audiences are already aligned with common use cases.
Limitations in Customization: Pre-defined audiences limit flexibility. You’re restricted to existing segment definitions, which may not cover all desired targeting needs.
Dependence on CDP's Audience Quality: The effectiveness depends on the granularity and accuracy of the pre-built audiences provided to the CDP.

Tradeoff?

Use Cases for Raw Data: Ideal when the CDP use cases involve complex segmentation, personalized content, AI/ML model training, or require insights beyond standard marketing segments.
Use Cases for Audience Access: Best for rapid deployment, simpler activation use cases, or environments where high customization is unnecessary.

Optimal Approach

Most CDP implementations use a hybrid model: access to both raw data and pre-defined audiences, enabling the flexibility to leverage existing audience data for straightforward campaigns and access raw data for more sophisticated needs. This balance is often most effective, supporting both quick activations and deep, data-driven personalization.

Last updated 1 year ago