ETL 200: Chaining of Data Distiller Jobs

Unleash the power of seamless insights with Data Distiller’s chained queries—connect your data, step by step, to drive better decisions

Overview

The goal of this case study is to process incremental processing on a dataset to create a new derived dataset.

Why Chain Data Distiller Jobs?

Chaining Data Distiller SQL jobs in marketing workflows can be extremely useful for managing sequential processes where each step depends on the output of the previous one. Most high value Data Distiller use cases end up using chaining of some form or the other.

Here are some examples of current use of Data Distiller for a wide ranging set of use cases:

Audience Segmentation:

First Job: A SQL job extracts and segments customers based on behavior (e.g., browsing history, purchase frequency, or demographic data).
Second Job: Another job enriches these segments with external data (e.g., cost of living, product preferences, or past purchase history).
Third Job: A job further enriches the segments by adding real-time engagement metrics (e.g., recent interactions like clicks, views, or cart additions).
Fourth Job: The next job generates personalized content (e.g., product recommendations, targeted offers) based on enriched segments.
Fifth Job: A final job structures the personalized datasets for campaign automation tools (e.g., email systems, ad platforms).

New Feature Alert: Data Distiller can create SQL audiences from AEP Data Lake that can be published as External Audiences in Real-Time Customer Profile for activation.

Adobe Journey Optimizer Performance Reporting

First Job: The first job collects raw engagement data (e.g., email opens, clicks, or social media interactions) from various marketing channels.
Second Job: A second job calculates key metrics such as click-through rates (CTR), conversion rates, and ROI for each campaign.
Third Job: The final job aggregates these metrics into daily, weekly, or monthly reports and sends the insights to a BI tool or dashboard..

Customer Journey Touchpoint Mapping

First Job: A SQL job pulls data on customer interactions across touchpoints (e.g., website visits, email engagement, and social media clicks).
Second Job: A second job sequences these interactions in chronological order to map each customer’s journey over time.
Third Job: Another job enriches the data by associating interactions with specific campaigns, offers, or promotions the customer encountered.
Fourth Job: A job groups interactions by channel (e.g., social media, email, website) to analyze the effectiveness of each channel on customer engagement.
Fifth Job: This job generates insights about customer behavior patterns (e.g., when they tend to convert or drop off) and flags high-value customers for retargeting.
Sixth Job: Another job calculates the time spent at each stage of the customer journey (e.g., from first interaction to purchase) to identify bottlenecks or areas for improvement.
Seventh Job: The final job outputs a comprehensive customer journey report, which helps marketers fine-tune messaging and timing across different channels for optimal engagement.

Lead Scoring Automation in Adobe B2B CDP and/or AJO B2B

First Job: A SQL job collects lead behavior data (e.g., content downloads, webinar attendance, or email responses) from multiple sources.
Second Job: A second job cleans and standardizes the data to ensure consistent formatting and structure for accurate scoring.
Third Job: The next job assigns scores to each lead based on predefined criteria (e.g., activity levels, engagement frequency, or demographic fit).
Fourth Job: A job segments leads based on their scores into categories such as "hot leads," "warm leads," or "cold leads," facilitating targeted follow-ups.
Fifth Job: This job enriches the lead data with additional insights, such as firmographic data or lead readiness indicators (e.g., industry, company size, or budget).
Sixth Job: Another job updates the CRM or marketing automation platform with the latest lead scores, triggering personalized follow-up actions and workflows.
Seventh Job: The final job generates a lead scoring performance report, tracking metrics like conversion rates and lead quality to refine and improve scoring criteria over time.

Product Recommendations in Adobe Commerce

First Job: A SQL job captures and processes customer interaction data, such as product views or add-to-cart actions.
Second Job: The next job identifies relevant product recommendations based on this behavior using algorithms or predefined business rules.
Third Job: The final job sends these product recommendations to an email marketing system or personalization engine for delivery to the customer.

Real-Time Customer Data Platform Activation: Ad Spend Optimization

First Job: A SQL job pulls data from various advertising platforms (e.g., Google Ads, Facebook Ads) about spend, impressions, and conversions for different campaigns.
Second Job: A second job standardizes and normalizes the data from different platforms to ensure consistency across metrics (e.g., converting currencies, time zones, or impression formats).
Third Job: This job calculates key performance indicators (KPIs) such as cost per acquisition (CPA), return on ad spend (ROAS), and conversion rate for each campaign.
Fourth Job: A job aggregates the KPIs by channel (e.g., Google Ads vs. Facebook Ads) to provide a comprehensive view of performance at both the channel and campaign levels.
Fifth Job: Another job compares these KPIs across channels and campaigns, identifying top-performing campaigns and those underperforming based on the defined thresholds (e.g., ROAS or CPA benchmarks).
Sixth Job: This job identifies campaigns with significant variations over time (e.g., sudden spikes in cost or drops in conversion rates) and flags them for deeper analysis.
Seventh Job: A job suggests budget reallocation by reallocating spend from underperforming campaigns to high-performing campaigns or channels based on the calculated KPIs.
Eighth Job: The next job forecasts future performance and ROI for the reallocated budget using predictive analytics based on past campaign performance trends.
Ninth Job: This job sends the budget reallocation suggestions to the marketing platform or ad management tool for implementation, ensuring real-time adjustments.
Tenth Job: The final job generates a performance report that tracks the effectiveness of the reallocation decisions, highlighting any improvements in ROAS, CPA, and overall campaign performance.

Most ad spend reporting in the industry relies on custom-built solutions to collect data from various platforms. FunnelIO is a prime example of a product that offers this capability out of the box, providing connectors that cover a wide range of systems.

Standard Attribution Analysis

First Job: A SQL job collects data from various touchpoints (e.g., paid ads, email campaigns, social media) where customers interact with the brand, including impressions, clicks, and conversions.
Second Job: A second job links these interactions to individual customer journeys, identifying which touchpoints contributed to each conversion (e.g., first-click, last-click, or multi-touch).
Third Job: A job assigns a basic attribution model (e.g., first-click, last-click, linear) to measure the contribution of each touchpoint towards the conversion.
Fourth Job: This job enriches the attribution model by incorporating customer demographic data and behavior to better understand how different customer segments respond to various channels.
Fifth Job: A job calculates key metrics for each touchpoint and channel, such as conversion rate, time-to-conversion, and cost per conversion, allowing for a detailed breakdown of performance.
Sixth Job: This job applies multi-touch attribution models (e.g., time decay, U-shaped, W-shaped) to give weight to each interaction in the customer journey based on its influence on the final conversion.
Seventh Job: A job aggregates attribution results by channel, campaign, and customer segment to identify which touchpoints are driving the most valuable conversions.
Eighth Job: This job compares attribution models (e.g., first-click vs. linear vs. time decay) to evaluate which model gives the most accurate representation of customer behavior and conversion paths.
Ninth Job: A job suggests optimization strategies for future campaigns by identifying underperforming channels and reallocating budget towards high-performing touchpoints based on the chosen attribution model.
Tenth Job: The final job generates an attribution performance report that tracks each channel's contribution to conversions over time, helping marketing teams optimize campaigns for better ROI.

Data Distiller includes built-in functions for first-touch and last-touch attribution. You can further customize these (time decay, linear, U-shaped, W-shaped, non-linear, weighted) using Window functions to suit your specific needs.

Media Mix Modeling

First Job: A SQL job pulls historical data on marketing spend and performance across different channels (e.g., TV, radio, digital, print) including impressions, clicks, and conversions.
Second Job: A second job standardizes the data by normalizing spend, reach, and engagement metrics across channels to create a unified dataset for analysis.
Third Job: A job calculates the contribution of each channel to overall sales or conversions using statistical methods like regression analysis, which allows for the identification of relationships between media spend and outcomes.
Fourth Job: This job enriches the model by incorporating external factors such as seasonality, economic conditions, or competitive activity, to adjust for their impact on marketing effectiveness.
Fifth Job: A job applies time-series analysis to examine how media spend over time influences sales trends and how different channels may have long-term or short-term effects.
Sixth Job: This job calculates diminishing returns for each channel, identifying the point where additional spend yields less incremental benefit, helping to optimize budget allocation.
Seventh Job: A job assigns weight to each media channel based on its effectiveness, creating a model that can forecast the likely outcomes of different budget scenarios (e.g., increasing TV ad spend vs. digital).
Eighth Job: This job runs simulations to test different media mix scenarios, forecasting outcomes such as expected sales growth or ROI for various spend allocations across channels.
Ninth Job: A job suggests an optimized media mix, reallocating budgets to high-performing channels and reducing spend on channels with lower returns, based on the model's output.
Tenth Job: The final job generates a media mix performance report, showing how changes in media spending influence sales or conversions, and provides recommendations for future marketing strategies based on the analysis.

New Feature Alert: New Statistical Models such as regression analysis are available in Data Distiller.

Media Mix Modeling faces similar challenges to those encountered in collecting data from various campaign reporting sources. First, the definitions and interpretations of metrics differ significantly across systems. Second, when standardizing these metrics and dimensions, certain assumptions must inevitably be made. Lastly, the granularity of data is often inconsistent or insufficient across these platforms.

Machine Learning Feature Engineering

First Job: A SQL job collects raw customer data (e.g., purchase history, website interactions, and demographics).
Second Job: Another job creates Recency, Frequency, Monetary (RFM) features based on customer transactions to quantify customer engagement.
Third Job: A job computes average session duration and product views per session, transforming raw website data into features that capture customer browsing behavior.
Fourth Job: This job generates time-based features, such as time since the last purchase and frequency of interactions over the last 90 days.
Fifth Job: Another job enriches the feature set by calculating discount sensitivity—whether a customer purchases more frequently when discounts are offered.
Sixth Job: The job then applies clustering algorithms (e.g., k-means) to group customers into segments like “high-value” or “at-risk” based on their features.
Seventh Job: A job normalizes and scales the features to ensure they are ready for model training.
Eighth Job: The next job performs feature selection, identifying the most predictive features for churn modeling.
Ninth Job: A job updates the dataset with new interaction data, allowing the features to be incrementally updated for real-time predictions.
Tenth Job: A final job exports the engineered feature set for training machine learning models, such as predicting customer churn or recommending products.

Today there is no integration between the Destination Scheduler and Data Distiller Anonymous Block. For Dataset Activation, read this tutorial.

Clean Room Data Collaboration through a Third-Party Identity Provider

First Job (Company A's Environment): A SQL job within Data Distiller collects and anonymizes Company A’s customer data (e.g., purchase history, demographic information) from internal systems, ensuring all PII (Personally Identifiable Information) is removed using hashing or tokenization techniques.
Second Job (Company B's Environment): A SQL job within Data Distiller collects and anonymizes complementary data from Company B’s dataset (e.g., external browsing behavior or interests), ensuring all data adheres to privacy standards by applying similar anonymization techniques.
Third Job: Each of Company A and Company B uploads their respective anonymized datasets through Data Distiller’s dataset activation feature to the third-party identity provider (IDP), enabling secure matching and analysis within the clean room environment.
Fourth Job: The third-party IDP runs a Data Distiller job to match customer records from both datasets using the anonymized identifiers (e.g., hashed email addresses), identifying shared customers between the two datasets.
Fifth Job: A SQL job within the IDP’s clean room combines Data Distiller’s anonymized internal data (e.g., purchase history from Company A) with Company B’s anonymized data (e.g., browsing behavior) to create a shared dataset of overlapping customers.
Sixth Job: Another Data Distiller job enriches the shared dataset by adding third-party external data (e.g., demographic or geographic information) for additional insights.
Seventh Job: A job runs privacy-preserving computations using methods like differential privacy, where noise is added to the data to protect individual identities. This ensures that insights on customer behaviors (e.g., purchase trends, engagement patterns) are generated without revealing personal information. The noise addition process ensures that individual data points remain indistinguishable, even in aggregated results, ensuring compliance with privacy regulations such as GDPR and CCPA.
Eighth Job: The clean room generates aggregated marketing insights from the combined dataset, such as cross-company customer behavior patterns and conversion rates.
Ninth Job: Another Data Distiller job runs predictive analytics to identify high-value customer segments or behaviors, helping both Company A and Company B optimize their marketing strategies.
Tenth Job: A final Data Distiller job outputs anonymized, aggregated reports for both companies, providing actionable insights (e.g., channel attribution, cross-platform behaviors) without compromising customer privacy.

There are a variety of cleanroom technologies available, including LiveRamp's Safe Haven, Infosum, Snowflake Clean Room, AWS Cleanrooms, ADH, and Merkle Merkury. If you're working with one of these vendors, you can skip steps 4 through 10. However, if you're a vendor planning to implement this as a custom solution using Data Distiller where you control the IP of the algorithms and the reporting, the steps outlined above are the key ones to consider.

What is a Snapshot?

Whenever new data is materialized onto the AEP Data Lake—whether through ingestion, upload, or a Data Distiller job—a new batch is created. If you examine the dataset, you'll notice it has multiple batch IDs linked to it. However, batches can often be too granular, requiring a higher level of abstraction. This is where the concept of a snapshot comes in—a snapshot represents a collection of new batches grouped together and assigned a snapshot ID. The reason multiple batches can end up in a single snapshot is that if the data volume is large and exceeds the internal maximum threshold for a batch, it will be split into additional batches. Data Distiller can read and process these snapshots, enabling incremental processing and making it a core capability for managing updates efficiently. But first, let us learn how to create these snapshots efficiently.

Getting Started

Our goal is to simulate a fictional stock price for the first 3 months of next year.

You will need to access the Data Distiller Query Pro Mode Editor or use your own favorite editor:

Navigate to Queries->Overview->Create Query

Sequential Execution Challenges

Let us say that we generate a randomized stock price for the first 3 months of 2025 with the stock price beetween $30 and $60.

SELECT 
  date_add('2025-01-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

Do not execute the code below but observe the pattern for creating an empty dataset. We create an empty table by creating a contradiction with the WHERE condition falsified.

CREATE TABLE stock_price_table AS
SELECT
  CAST(NULL AS DATE) AS date,
  CAST(NULL AS DECIMAL(5, 2)) AS stock_price
WHERE FALSE;

Do not execute the code below but observe the pattern for January 2025:

INSERT INTO stock_price_table
SELECT 
  date_add('2025-01-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

Do not execute the code below but observe the pattern for the month of February 2025:

INSERT INTO stock_price_table
(SELECT 
  date_add('2025-01-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date);

Do not execute the code below but observe the pattern for March 2025:

INSERT INTO stock_price_table
SELECT 
  date_add('2025-03-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq  -- March has 31 days
ORDER BY date;

If you were to run each of the above queries individually, the process would be very time-consuming because both the CREATE TABLE AS and INSERT INTO statements write data to the data lake. This triggers the batch processing service in Data Distiller, which starts a cluster, runs the job, and then shuts the cluster down. This cycle of spinning the cluster up and down for each query will cause unnecessary delays, as you'll be waiting for both the startup and shutdown phases with every execution. On the average, spin-up and spin-down of the cluster takes about 5 minutes each. Since we have 4 queries, this would take atleast 40 minutes.

Anonymous Block

An Anonymous Block in Data Distiller refers to a block of SQL code that is executed without being explicitly named or stored in the database for future reuse. It typically includes procedural logic such as control-flow statements, variable declarations, and exception handling, all enclosed within a BEGIN...END block. The great thing about an anonymous block is that it runs all the SQL code within a single cluster session, eliminating the need to repeatedly spin up and down multiple clusters. This helps save both time and compute resources.

Observe the syntax for BEGIN and END. There are two $ signs that are placed above BEGIN and after END. Every block of SQL code has a semicolon to separate them out.

$$ 
BEGIN
    CREATE TABLE TABLE_A AS SELECT * FROM TABLE_1;
    CREATE TABLE TABLE_B AS SELECT * FROM TABLE_2;
    EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'TABLE 2 ERROR';
END
$$;

Let us dissect the above query:

BEGIN ... END Block: The BEGIN and END block groups a series of statements that need to be executed as a single unit
EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR'
This block handles any errors that occur during the execution of the BEGIN ... END block:
- EXCEPTION is used to define error-handling logic. Syntax errors are captured at compile time but the EXCEPTION errors are to do with the data or the tables themselves.
- WHEN OTHER THEN catches any error or exception that happens in the preceding statements.
- SET @ret = SELECT 'ERROR' assigns the value 'ERROR' to the variable @ret, signaling that an error occurred during the execution.
- Keep in mind that any variables declared within an Anonymous Block exist only for the duration of that block's execution. However, the @ret variable in the example above is unique because it's used in the EXCEPTION handling clause, allowing it to persist beyond the session.
- If an EXCEPTION in any of the chained queries is met, the query execution stops.

Do not attempt to use SELECT queries within a BEGIN...END block expecting interactive results to stream to your editor. Although the code will execute, no results will be streamed and you will encounter errors. You can still declare variables, use conditions, and handle exceptions, but these features are intended for use within the context of a Data Distiller job, such as creating and deleting datasets, including temporary tables.

Remember, that Anonymous blocks are primarily used for procedural logic (e.g., variable assignments, loops, error handling, DML operations) and do not support interactive result streaming.

The query below is expected to take about 20-30 minutes to complete, with around 10 minutes spent on spinning up and down resources, and an additional 10-20 minutes writing the data to the data lake. Keep in mind that data mastering might be delayed by other processes writing to the data lake.

Do not execute the query just yet, as you'll end up waiting a long time for it to finish. Instead, you can comment out the BEGIN END block and change TABLE to TEMP TABLE to bypass the batch processing engine and run the query in ad hoc mode. TEMP TABLES are cached for the session. Once you've verified the results, you can then execute the full query.

Ideally, you should schedule this query to run in the background, as your time is valuable, and it's essential to use the most efficient query techniques for deployment.

$$
BEGIN
  
--Drop the table if it exists
DROP TABLE IF EXISTS stock_price_table;

--Create an empty dataset via a contradiction
CREATE TABLE stock_price_table AS
SELECT
  CAST(NULL AS DATE) AS date,
  CAST(NULL AS DECIMAL(5, 2)) AS stock_price
WHERE FALSE;

-- Insert for January 2025
INSERT INTO stock_price_table
SELECT 
  date_add('2025-01-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq
ORDER BY date;

--Insert data for February 2025
INSERT INTO stock_price_table
SELECT 
  date_add('2025-02-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 27)) AS i) seq  -- February has 28 days in 2025
ORDER BY date;

--Insert data for March 2025
INSERT INTO stock_price_table
SELECT 
  date_add('2025-03-01', seq.i) AS date,
  CAST(30 + (RAND() * 30) AS DECIMAL(5, 2)) AS stock_price
FROM 
  (SELECT explode(sequence(0, 30)) AS i) seq  -- March has 31 days
ORDER BY date;

END
$$;

Let us verify the results of the query:

SELECT * FROM stock_price_table
ORDER BY date;

If you are using DBVisualizer, you have to use the backslash to make the code work:

--/

$$ BEGIN

CREATE TABLE table_1 AS SELECT * FROM TABLE_1;

EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR';

END

$$;

Show all the `SNAPSHOTS` in a Dataset

A snapshot ID is a checkpoint marker, represented as a Long-type number, applied to a data lake table each time new data is written. The SNAPSHOT clause is used in conjunction with the table relation it is associated with.

Let us first try and see all the snapshots that are there in the table:

SELECT history_meta('stock_price_table')

The results will look like this in the Data Distiller Query Pro Editor. There are 5 snapshot IDs, the first one is just the creation of an empty dataset. Each INSERT INTO led to a new snapshot: January data is in Snapshot ID=2, February data is in Snapshot ID=3, and March data is in Snapshot ID=4.

Remember that history_meta will only give you the rolling 7 days worth of snapshot data. If you want to retain the history, you will need to create a Data Distiller job to insert this data periodically into a new table.

snapshot_generation: This indicates the generation or version of the snapshot. Each time data is written or updated, a new snapshot is created with an incremented generation number.
made_current_at: This column represents the timestamp of when the snapshot was made current, showing when that particular snapshot was applied or written to the table.
snapshot_id: This is the unique identifier for each snapshot. It's typically a Long-type number used to refer to specific snapshots of the data.
parent_id: This field shows the parent snapshot ID, which means the snapshot from which the current snapshot evolved. It reflects the relationship between snapshots where one might have been derived or evolved from another.
is_current_ancestor: This is a Boolean column indicating whether this snapshot is an ancestor of the current snapshot. If true, it means that this snapshot is part of the lineage leading up to the most recent snapshot.
is_current: This Boolean flag indicates whether this snapshot is the most current one. If true, it marks the latest version of the table as of that snapshot.
output_record_count: This shows the number of records (rows) in the snapshot when it was created.
output_byte_size: This represents the size of the snapshot in bytes, indicating how much data was stored in that snapshot.

Note that snapshot_ids will be monotonic i.e. always increasing but they will not be sequential (0, 1, 2, 3, 4) as they are generated and used by other datasets as well. They could well look like (0, 1, 2, 32, 43).

Analyze `SNAPSHOT` Data

Keep in mind that summing the output_byte_size column provides a good approximation of the total dataset size, though it doesn't include metadata. The same approach applies to counting the total number of records in the dataset. Additionally, you can compute the richness of the records for each snapshot by dividing the size of the snapshot by the number of records in that snapshot

CREATE TEMP TABLE stock_meta_table AS SELECT history_meta('stock_price_table');

SELECT * FROM stock_meta_table;

It is recommended to create a TEMP TABLE instead of a permanent table, as materializing the dataset can take several minutes. Keep in mind that the history_meta function only provides the last 7 days of snapshot data, which is sufficient for most use cases like incremental processing. If you need to persist all snapshot information beyond this period, you will need to set up a Data Distiller job to read new snapshots and regularly persist them to a table in the data lake.

The number of records across all snapshots logged in the last 7 days is:

SELECT SUM(output_record_count) FROM stock_meta_table;

The approximate size of this dataset in GB based on the record sizes in the snapshots is:

SELECT SUM(output_data_size) / 1073741824 AS total_size_gb
FROM stock_meta_table;

Execute `SNAPSHOT` Clause-Based Queries

`SELECT` Data from a `SNAPSHOT` `SINCE` a Start `SNAPSHOT` ID

SELECT * 
FROM stock_price_table 
SNAPSHOT SINCE 2  -- Replace '2' with your desired start_snapshot_id
ORDER BY date;

This query retrieves data from the snapshotstarting from the snapshot with ID 2, with all ddates in February inclusive.

SNAPSHOT with a AS OF excludes the snapshot in its clause but includes all others before it.

`SELECT`Data from `AS OF Snapshot ID`

SELECT * 
FROM stock_price_table 
SNAPSHOT AS OF 3  -- Replace '3' with your desired snapshot_id
ORDER BY date;

This query retrieves data as it existed at the time of snapshot ID 3.This will show the data for both January and February, all dates inclusive.

SNAPSHOT with a AS OF excludes the snapshot in its clause but includes all others before it.

`SELECT` Data Between Two `SNAPSHOT` IDs

SELECT * 
FROM stock_price_table 
SNAPSHOT BETWEEN 2 AND 4  -- Replace '1' and '4' with your desired start and end snapshot IDs
ORDER BY date;

This retrieves data changes that occurred between snapshot IDs 2 and 4. This will get you all the results for February and March. The starting Snapshot ID=2 is excluded but all the other snapshot IDs 3 and 4 are included.

SNAPSHOT with a BETWEEN clause will always include the first snapshot but include the last one.

`SELECT` Data between the Most Recent `SNAPSHOT` (HEAD) and a specific `SNAPSHOT`

SELECT * 
FROM stock_price_table 
SNAPSHOT BETWEEN 'HEAD' AND 2  -- Replace '2' with your start_snapshot_id
ORDER BY date;

HEAD in the SNAPSHOT clause represents the earliest SNAPSHOT ID i.e. 0. This retrieves data between the earliest snapshot (HEAD) which is the month of January and will exclude SNAPSHOT ID=2.

`SELECT` Data Between a Specific `SNAPSHOT` and the Earliest `SNAPSHOT` (TAIL)

SELECT * 
FROM stock_price_table 
SNAPSHOT BETWEEN 2 AND 'TAIL'  -- Replace '3' with your end_snapshot_id
ORDER BY date;

This retrieves data between snapshot ID= 2 and the very last snapshot (TAIL i.e. 4) which will be excluded. You will only see the months of February and March.

Trapping Errors via Exception Handling

In our sequential chaining of SQL queries within the Anonymous Block, there's a significant flaw: what if a syntax error causes a data insertion to fail, but the next block contains a DROP command? As it stands, the Anonymous Block will continue executing each SQL block, regardless of whether the previous ones succeeded or failed. This is problematic because a small error could trigger a domino effect, potentially causing further damage to the system. To avoid this, we need a way to stop execution when an error occurs and trap the error for debugging purposes.

Let us first execute a query that has a syntax error 'ASA'. You should see the error in an instant. EXCEPTION handling did not kick in:

$$ 
BEGIN
    DROP TABLE IF EXISTS TABLE_A;
    CREATE TABLE TABLE_A ASA SELECT * FROM stock_price_table;
    EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR';
END
$$;

Remember that:

EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR'

This block handles any errors that occur during the execution of the BEGIN ... END block:

EXCEPTION is used to define error-handling logic. Syntax errors are captured at compile time but the EXCEPTION errors are to do with the data or the tables themselves.
WHEN OTHER THEN catches any error or exception that happens in the preceding statements.
SET @ret = SELECT 'ERROR' assigns the value 'ERROR' to the variable @ret, signaling that an error occurred during the execution.
Keep in mind that any variables declared within an Anonymous Block exist only for the duration of that block's execution. However, the @ret variable in the example above is unique because it's used in the EXCEPTION handling clause, allowing it to persist beyond the session.
If an EXCEPTION in any of the chained queries is met, the query execution stops.

Error should look like:

8:41:13 PM > Query failed in 0.484 seconds.
8:41:13 PM > ErrorCode: 42601 queryId: 3690c93f-270e-4b72-8605-94003b131cc3 Syntax error encountered. Reason: [line 2:26: mismatched input 'ASA' expecting {'.', '(', ';', 'COMMENT', 'WITH'}]

Let us execute the query trying to select a column that does not exist:

$$ 
BEGIN
    DROP TABLE IF EXISTS TABLE_A;
    CREATE TABLE TABLE_A AS SELECT a FROM stock_price_table;
    EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR';
END
$$;

The job will start executing and even declare success because the outer Anonymous Block code executed successfully but if you go into the Queries->Log, you will see after some search:

The problem with searching in Query->Log is that all of the queries inside the Anonymous Block have been disaggregated and logged separately. If we want to see all of the queries and their status, we need to take a different approach.
Navigate to Queries->Scheduled Queries and locate your failed query:

Click on the query and you should see the query run within the Anonymous Block listed in the left panel

You will see the status in the left panel per query. You will see the Overview that lists the entire query:

Click Query 1:

Scheduling of Anonymous Block

Copy paste the following query in the Data Distiller Query Pro Mode Editor. All that this query does is to drop the table and recreate it.

$$ 
BEGIN
    DROP TABLE IF EXISTS TABLE_A;
    CREATE TABLE TABLE_A AS SELECT a FROM stock_price_table;
    EXCEPTION WHEN OTHER THEN SET @ret = SELECT 'ERROR';
END
$$;

Name the template by giving it a name: Anonymous_test

Click on Save and Close.
Launch the template again from the Templates pane.
You should see the following:

Click on Add Schedule
Data Distiller Scheduler screen looks like the following:

Here are the parameters of the scheduler:

Frequency: Hourly, Daily, Weekly, Monthly, Yearly.
Every: When the schedule is supposed to execute. For example, if you choose weekly option, you can choose which of the week you want this schedule to run.
Scheduled Start Time: Specified in UTC which can be extracted using the code:
SELECT from_unixtime(unix_timestamp()) AS utc_time;
Query Quarantine: Stops the schedule from wasting your resources if it fails for 10 times in a row.
Standard Alerts are available except for Query Run Delay where an alert is sent out if the time taken by the query as it is running exceeds the Delay Time you have sent. So if a query is executing and it goes past the 150th minute, an alert will be sent. The query will still continue to execute until it succeeds or fails.

If you want anything custom such as frequency like every 15 minutes, you can use the Data Distiller APIs

Click Save.

Last updated 9 months ago

Overview

Why Chain Data Distiller Jobs?

Audience Segmentation:

Adobe Journey Optimizer Performance Reporting

Customer Journey Touchpoint Mapping

Lead Scoring Automation in Adobe B2B CDP and/or AJO B2B

Product Recommendations in Adobe Commerce

Real-Time Customer Data Platform Activation: Ad Spend Optimization

Standard Attribution Analysis

Media Mix Modeling

Machine Learning Feature Engineering

Clean Room Data Collaboration through a Third-Party Identity Provider

What is a Snapshot?

Getting Started

Sequential Execution Challenges

Anonymous Block

Show all the SNAPSHOTS in a Dataset

Analyze SNAPSHOT Data

Execute SNAPSHOT Clause-Based Queries

SELECT Data from a SNAPSHOT SINCE a Start SNAPSHOT ID

SELECTData from AS OF Snapshot ID

SELECT Data Between Two SNAPSHOT IDs

SELECT Data between the Most Recent SNAPSHOT (HEAD) and a specific SNAPSHOT

SELECT Data Between a Specific SNAPSHOT and the Earliest SNAPSHOT (TAIL)

Trapping Errors via Exception Handling

Scheduling of Anonymous Block

Show all the `SNAPSHOTS` in a Dataset

Analyze `SNAPSHOT` Data

Execute `SNAPSHOT` Clause-Based Queries

`SELECT` Data from a `SNAPSHOT` `SINCE` a Start `SNAPSHOT` ID

`SELECT`Data from `AS OF Snapshot ID`

`SELECT` Data Between Two `SNAPSHOT` IDs

`SELECT` Data between the Most Recent `SNAPSHOT` (HEAD) and a specific `SNAPSHOT`

`SELECT` Data Between a Specific `SNAPSHOT` and the Earliest `SNAPSHOT` (TAIL)