Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Prerequisites
  • Case Study Overview
  • Understanding the Dataset
  • HyperLogLog Sketches: The Key to Scalable and Efficient Big Data Insights on Unique Counts
  • Use Case: Campaign Uniques Across Date Ranges
  • Use Case: Microsegments Along Various Dimensions
  • Use Case: Understanding True Audience Reach
  • Create HyperLogLog (HLL) Sketches
  • Configuration Parameters in HLL Sketches
  • Confidence Intervals in HLL
  • Estimate Distinct User Counts
  • Merge Sketches for Cross-Dimensional Analysis
  • Estimate Overall Distinct Users
  • Two Approaches and Their Tradeoffs
  • Approach 1: hll_merge_agg + hll_estimate
  • Approach 2: hll_merge_count_agg
  • The Tradeoffs
  • Incremental Additions to the Dataset
  • Best Practice with Incremental Additions
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics

Turning Big Data into Big Insights with Speed, Precision, and Scalability

Last updated 5 months ago

Prerequisites

Ingest the following CSV files

by following the steps in the tutorial below

Case Study Overview

E-commerce platforms generate an overwhelming amount of interaction data daily, capturing every click, view, and purchase. Understanding user behavior across product categories is essential for tailoring promotions, uncovering preferences, and ultimately driving revenue. However, the sheer scale of this data creates significant challenges, particularly when attempting to process it efficiently for actionable insights.

Data Distiller offers a solution by simplifying big data analysis through its use of hypercubes and sketches. These advanced tools enable efficient aggregation and processing of data, drastically reducing computational overhead. In this case study, we leverage Data Distiller to achieve three key objectives: counting distinct users per category, analyzing user behavior across multiple categories, and merging data efficiently without the need to reprocess historical datasets.

Analyzing e-commerce data requires addressing fundamental questions: How many unique users interacted with each category? What patterns emerge in cross-category behaviors? And how can insights be delivered without repeatedly recalculating metrics? Traditional systems fall short in this regard, often requiring the re-reading of raw data and recalculating metrics, which is both time-intensive and resource-heavy.

By utilizing hypercubes, Data Distiller overcomes these inefficiencies. It employs probabilistic data structures, such as sketches, to create compact, efficient summaries of datasets. This approach not only accelerates processing but also ensures scalability, allowing organizations to focus on driving insights and delivering value to their users.

Understanding the Dataset

The dataset represents simulated user interactions on an e-commerce platform, capturing a broad range of activity from 100 unique users over the course of November 2024. Each user, identified by a unique user_id (e.g., U1, U2), engages with the platform through multiple actions, interacting with products across various categories. These categories, including Electronics, Apparel, Home Goods, Books, and Beauty, reflect common e-commerce offerings and provide a foundation for analyzing user preferences and behaviors.

Each interaction is tied to a specific product_id (e.g., P101, P102), enabling detailed tracking of user-product engagements. The interaction_time field, recorded as a timestamp, offers insights into when these interactions occur, revealing temporal patterns such as peak shopping hours or specific dates of increased activity. The dataset spans the entire month, providing a comprehensive view of user activity over time.

User actions are categorized into three interaction_types: view, purchase, and cart_add. These types represent the customer journey, from initial product exploration to the decision to buy or save an item for later. By capturing these diverse actions, the dataset enables a deeper understanding of customer intent and conversion rates across different product categories.

This rich dataset is ideal for exploring questions such as: How many unique users interacted with each category? Which products or categories drive the most purchases? Are there patterns in user behavior across different times of the day or days of the week? It provides a solid foundation for analytics, segmentation, and predictive modeling, making it a valuable resource for developing strategies to enhance customer engagement and drive revenue.

Schema looks like the following:

  • user_id: Unique identifier for users.

  • product_id: Identifier for products.

  • category: Product category.

  • interaction_time: Timestamp of interaction.

  • interaction_type: Type of interaction (e.g., view, purchase).

HyperLogLog Sketches: The Key to Scalable and Efficient Big Data Insights on Unique Counts

Cardinality-based insights are critical for understanding true audience reach, optimizing resource allocation, and driving personalized user engagement. However, deriving these insights from traditional methods can be prohibitively expensive in terms of both computation and storage. This is where HyperLogLog (HLL) sketches come into play, revolutionizing how businesses calculate cardinality by offering a fast, scalable, and cost-efficient solution.

Traditional methods for computing cardinality involve storing and processing raw data to identify unique elements. For example, counting unique users interacting with our e-commerce platform over multiple campaigns requires repeatedly scanning through massive datasets, de-duplicating entries, and aggregating results. This approach demands substantial computational power and storage resources, which scale poorly as datasets grow. As a result, businesses face escalating infrastructure costs, slower query execution times, and significant delays in delivering actionable insights.

Additionally, traditional systems struggle with real-time analytics. To answer a simple question like, "How many unique users engaged with Campaign A over the last 30 days?" businesses must process historical data alongside new interactions, often leading to inefficiencies and delays.

HyperLogLog (HLL) sketches are a probabilistic data structure designed to estimate the cardinality, or the number of unique elements, within a dataset. Unlike traditional methods that store and process every individual element to compute distinct counts, HLL sketches use a compact representation that drastically reduces memory requirements. They achieve this efficiency by using hash functions to map dataset elements to binary values and then analyzing the patterns of trailing zeroes in these hashed values. The longer the sequence of trailing zeroes, the rarer the element, which provides a statistical basis for estimating the overall cardinality. The resulting HLL sketch is a small, fixed-size data object that can represent billions of unique items with a high degree of accuracy.

One of the key benefits of HLL sketches is their remarkable efficiency in handling large-scale datasets. Because the size of the sketch remains constant regardless of the dataset's size, they are highly scalable and suitable for big data applications. This efficiency makes them particularly valuable for systems that need to process streaming data or perform real-time analytics, as they can quickly update the sketch with minimal computational overhead.

Another significant advantage of HLL sketches is their ability to support operations like merging. By combining two or more sketches, it is possible to estimate the unique count of a union of datasets without accessing the original data. This property is incredibly useful in distributed systems where data is processed in parallel across multiple nodes. HLL sketches enable these systems to efficiently consolidate results and provide global insights with minimal communication overhead.

Use Case: Campaign Uniques Across Date Ranges

In marketing, one of the fundamental metrics is understanding how many unique users engage with a campaign over specific date ranges. Traditional methods of calculating unique users require processing raw data repeatedly, which becomes computationally expensive and slow as data scales. HyperLogLog (HLL) sketches provide a solution by offering compact and efficient cardinality estimation.

For example, consider a scenario where a campaign spans multiple weeks, and the goal is to understand unique user engagement week-by-week or across the entire campaign period. By leveraging HLL sketches, a sketch is created for each week's user interactions. These sketches, which represent the unique users for each week, can be stored and later merged to estimate the total number of unique users for the entire campaign without requiring access to the original data. This capability is particularly valuable for real-time reporting, as it eliminates the need to reprocess historical data whenever new information is added.

Furthermore, HLL sketches can be used to compare user engagement across date ranges. For instance, you might want to see how many users who interacted with the campaign in the first week returned in subsequent weeks. This overlap analysis becomes seamless with sketches, as you can compute intersections and unions of sketches across different periods to reveal trends, retention rates, and campaign effectiveness. These insights allow marketers to fine-tune their strategies, optimize engagement, and measure campaign ROI efficiently.

Use Case: Microsegments Along Various Dimensions

Segmentation is critical in personalized marketing, where campaigns are tailored to specific subsets of users based on their characteristics or behaviors. Microsegmentation takes this concept further, dividing users into highly granular groups based on multiple dimensions such as location, product preferences, device type, and interaction type. Calculating metrics like unique users for these microsegments can quickly become unmanageable as the number of dimensions and their combinations increase.

HyperLogLog sketches enable efficient microsegmentation by allowing unique counts to be computed along multiple dimensions without recalculating from raw data. For example, an e-commerce platform might create HLL sketches for users who viewed products, added them to the cart, or made a purchase, segmented by categories like "Electronics," "Apparel," or "Books." These sketches can then be further segmented by other dimensions such as geographical regions or device types. Marketers can instantly estimate the number of unique users in any segment or combination of segments without additional processing.

In practice, this allows businesses to identify high-value microsegments, such as users in a specific region who frequently purchase a particular product category. Additionally, HLL sketches can help track microsegment growth over time or analyze overlaps between segments, such as users who interact with multiple categories. By unlocking insights at this granular level, businesses can deliver hyper-targeted campaigns, enhance user experiences, and maximize conversion rates while maintaining scalability and efficiency in their data operations.

Use Case: Understanding True Audience Reach

In marketing, knowing the total number of unique users engaging with a campaign provides a clear picture of its actual reach. Without cardinality, repeated interactions from the same users might inflate metrics, leading to an overestimation of success. By accurately measuring unique engagements, businesses can assess the effectiveness of their campaigns, allocate resources more effectively, and ensure they are reaching the intended audience.

For instance, a campaign may generate 1 million clicks, but if only 100,000 unique users are responsible for those clicks, it indicates a concentration of activity among a small audience. This insight might prompt marketers to expand their targeting strategies to reach a broader demographic.

Create HyperLogLog (HLL) Sketches

To calculate distinct users for each category, we'll aggregate interactions using the hll_build_agg function. This function creates a compact sketch for estimating unique users.

CREATE TABLE category_sketches AS
SELECT 
    category,
    hll_build_agg(user_id, 10) AS user_sketch
FROM 
    user_interactions
GROUP BY 
    category;
    
SELECT * FROM category_sketches;

This SQL query creates a new table named category_sketches to store compact representations of unique user interactions with different product categories. It groups the data from the existing user_interactions table by the category column and applies the hll_build_agg function to the user_id column within each category. Additionally, the query specifies a parameter for the hll_build_agg function, which defines the precision of the HyperLogLog (HLL) sketch by setting the number of buckets used in the estimation. The HLL sketch, a probabilistic data structure, efficiently estimates the number of unique users (cardinality) in each category without storing or scanning all individual user IDs.

The resulting table, category_sketches, contains two columns: category, which identifies the product category, and user_sketch, which holds the HLL sketch for that category, configured with the specified precision level. By adjusting the parameter, the query balances accuracy and memory efficiency, making it adaptable for different use cases. This approach reduces data size and enables scalable, cost-effective cardinality calculations for insights such as audience reach or engagement patterns across categories.

Creation of the HLL sketch column in the table looks like the following in DBVisualizer:

This is what the result looks like after executing aSELECT query on the resulting dataset:

In this query result, the column labeled USER_SKETCH contains HyperLogLog (HLL) sketches, which are compact probabilistic representations of the unique users interacting within each category. These sketches are generated by the hll_build_agg function applied to the user_id column during the query.

Each sketch encodes the distinct user IDs for the corresponding CATEGORY (e.g., "Home Goods," "Apparel"). The encoded string in the USER_SKETCH column is not raw data but a fixed-size structure that estimates the cardinality (number of unique user IDs) efficiently. This enables large-scale datasets to be summarized in a memory-efficient manner, as the size of each sketch remains small regardless of the number of users in the category.

These sketches can be used in subsequent queries to quickly calculate the estimated unique user counts (hll_estimate), combine sketches from different categories (hll_merge_agg), or analyze overlaps between categories. This approach avoids repeatedly processing raw data, reducing computational cost and time while maintaining accuracy for decision-making.

All Data Distiller SQL queries for creation, merging, and estimating unique counts are fully functional across both the Data Lake and the Data Distiller Warehouse, also known as the Accelerated Store.

At present, sketch columns are immutable and cannot be updated after creation. However, future updates are expected to introduce functionality that allows for updating existing sketch columns. This enhancement will enable more effective handling of scenarios such as missed processing runs or late-arriving data, ensuring greater flexibility and accuracy in data management workflows.

Sometimes, you want to build a single HLL sketch that combines multiple unique identifiers from the same dataset. For example:

  • In a multi-channel marketing context, you might want to track a user's unique interactions across email, app, and web by combining email_id, app_user_id, and web_cookie_id into a single sketch.

  • In Adobe Real-Time Customer Data Platforms, users have multiple identifiers, combining these into a single sketch ensures accurate cardinality estimation across different data sources.

If our dataset includes email_id, app_user_id, and webcookie_id instead of a guaranteed user_id, you can use the COALESCE function to ensure that at least one non-null identifier is used for generating the HLL sketch:

CREATE TABLE category_sketches 
AS SELECT category, hll_build_agg(coalesce(email_id, app_user_id, webcookie_id), 10) AS user_sketch 
FROM user_interactions GROUP BY category;

Configuration Parameters in HLL Sketches

If you look at the code for hll_build_agg above, you will observe that it has a configuration parameter of 10. If you do not specify this value, the default value of 12 is chosen.

hll_build_agg(user_id, 10) 

The configuration parameter specifies the log-base-2 of the number of buckets (K) used in the HLL sketch. Buckets are the internal data structures within the sketch used to estimate cardinality. Increasing the parameter increases the number of buckets, improving the precision of the cardinality estimate but also requiring more memory to store the sketch. The total number of buckets K is calculated as

The valid range of the parameter is from 4 to 12

  • Minimum Value: 4 (16 buckets, low precision, very memory efficient). Lower values are sufficient for exploratory analysis.

  • Maximum Value: 12 (4096 buckets, high precision, higher memory usage). A high value may be required for highly sensitive financial or compliance reporting.

Confidence Intervals in HLL

In HLL sketches, the confidence interval is the range within which the true cardinality is expected to fall, given the estimated value. The size of this range is inversely proportional to K, the number of buckets. In simpler terms:

  • As K increases, the confidence interval becomes narrower, meaning the estimate is more precise.

  • A smaller K results in a wider confidence interval, meaning the estimate is less precise but requires less memory.

The confidence interval for HLL typically follows a standard format, such as:

Implications of K for Confidence Intervals:

  1. Higher value of K (e.g. parameter value of 12):

    • K=4096 implies that the relative error is 0.016 (or 1.6%).

    • The estimate will have a tight confidence interval, making it highly reliable.

    • This configuration is useful for scenarios requiring high precision, such as compliance reporting or sensitive financial analytics.

  2. Lower value of K (e.g. parameter value of 10):

    • K=1024 implies that the relative error increases to 0.032 (or 3.2%).

    • The confidence interval is slightly wider, making the estimate less precise but still sufficient for general analytics.

    • This setup is memory-efficient and suitable for exploratory or real-time analytics where speed is prioritized over absolute precision.

Estimate Distinct User Counts

The hll_estimate function calculates the estimated number of unique users for each category.

SELECT 
    category,
    hll_estimate(user_sketch) AS distinct_users
FROM 
    category_sketches;

The result is:

If we had executed the above query in the old fashioned way:

SELECT 
    category,
    COUNT(DISTINCT user_id) AS distinct_users
FROM 
    user_interactions
GROUP BY 
    category;

The results are nearly identical due to the smaller data size, highlighting how sketches become significantly more efficient as the scale of the dataset increases.

Merge Sketches for Cross-Dimensional Analysis

Our use case is to calculate the total unique users across all categories. Instead of recomputing the distinct counts from raw data, we can use a merge function like hll_merge_agg, which deduplicates the unique IDs across each of these dimensions efficiently.

This query is specifically designed to merge the HyperLogLog (HLL) sketches from all the categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics") into a single, compact sketch. This merged sketch represents the estimated total unique users across all categories combined, ensuring that users appearing in multiple categories are only counted once.

To analyze behavior across categories, hll_merge_agg allows us to combine individual category-level sketches into a single sketch that maintains cardinality estimates without requiring access to the raw data. This approach is computationally efficient and scalable, making it ideal for handling large datasets or performing cross-category audience analysis.

SELECT 
    hll_merge_agg(user_sketch) AS merged_sketch
FROM 
    category_sketches;

The result looks like the following:

Estimate Overall Distinct Users

Our use case is to calculate the total number of distinct users across all categories while directly deriving the final estimated count. Instead of merging sketches and performing an additional estimation step, we can use the hll_merge_count_agg function, which not only combines the HyperLogLog (HLL) sketches from each category but also calculates the estimated total number of unique users in one step.

This query efficiently aggregates the HLL sketches from all categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics"), deduplicating unique IDs across these categories and directly returning the estimated count of distinct users. By using hll_merge_count_agg, we streamline the process of combining category-level sketches while avoiding overcounting users who interacted with multiple categories.

The function simplifies cross-category analysis by eliminating the need for a separate hll_estimate step after merging. This makes it ideal for scenarios where the primary objective is to retrieve the final count of unique users across all dimensions with minimal processing overhead, ensuring accuracy and scalability for large datasets.

SELECT 
    hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM 
    category_sketches;

The result looks like the following:

Two Approaches and Their Tradeoffs

Approach 1: hll_merge_agg + hll_estimate

SELECT 
    hll_estimate(
        hll_merge_agg(user_sketch)
    ) AS total_distinct_users
FROM 
    category_sketches;

The result will be:

When to use this approach

  • This approach is more flexible because the merged sketch can be reused for additional operations (e.g., further aggregations, intersections, or unions with other sketches) beyond just estimating the cardinality.

  • It is ideal if you need both the merged sketch for downstream use and the estimated count.

Approach 2: hll_merge_count_agg

SELECT 
    hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM 
    category_sketches;

The result will be:

When to use this approach:

  • This approach is more streamlined and efficient when the goal is solely to get the final estimated count of distinct users.

  • It avoids creating an intermediate merged sketch, saving processing time and memory if the merged sketch is not needed for further analysis.

The Tradeoffs

  • Flexibility: The hll_merge_agg + hll_estimate approach provides an intermediate sketch (merged_sketch) that can be reused, offering more flexibility for additional operations. In contrast, hll_merge_count_agg is a one-step solution that calculates the count without producing a reusable sketch.

  • Efficiency: If your goal is just the final distinct count, hll_merge_count_agg is more efficient because it combines merging and estimation in a single operation.

  • Reusability: If further operations (e.g., intersections, unions, or additional merges) are needed with the combined data, hll_merge_agg is preferred because it generates a reusable merged sketch.

Both approaches yield the same estimated result when the goal is only to calculate the total number of distinct users. However, hll_merge_agg is more versatile, while hll_merge_count_agg is optimized for simplicity and efficiency when no further operations are required. Your choice depends on whether you need the intermediate sketch for additional analysis.

Incremental Additions to the Dataset

As previously mentioned, sketch columns are immutable and cannot be modified after they are created. However, new rows containing sketch columns can be added, and aggregations can be performed on these rows to incorporate the new data into the analysis.

-- Insert new rows 
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id, 10) AS user_sketch
FROM 
    new_interactions
GROUP BY 
    category;

-- Examine the dataset
SELECT * FROM category_sketches;

-- Now execute the merge
SELECT 
    category,
    hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM 
    category_sketches
GROUP BY 
    category;

TheSELECT query will show multiple rows:

The aggregate count query shows the following - ensure that you use the GROUP BYclause since you have multiple rows with the same category name

Ensure that the configuration parameter for bucketing i.e. K remains consistent across all INSERT and CREATE queries. This is crucial because the merge and aggregation functions require all sketches to have the same number of buckets in order to work correctly. Inconsistent bucketing configurations will result in errors during these operations.

Best Practice with Incremental Additions

To effectively manage and track data updates when creating new rows with aggregates, it's important to include a timestamp column that records the day of processing. This timestamp ensures that each new block of data can be tied to its processing date, enabling better traceability, data auditing, and incremental updates. By recording the processing date, you can differentiate between historical and newly added data, making it easier to debug, analyze trends, and optimize queries. This approach is especially useful in scenarios where data arrives in batches or where late-arriving data needs to be incorporated incrementally.

You will need to rewrite the query the following way and execute it block by block:

-- Create the empty dataset first
CREATE TABLE category_sketches AS
SELECT
  CAST(NULL AS STRING) AS category,
  CAST(NULL AS STRING) AS user_sketch,
  CAST(NULL AS TIMESTAMP) AS processing_date
WHERE FALSE;

-- Insert backfill data with a processing timestamp
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id) AS user_sketch,
    CAST(NOW() AS TIMESTAMP) AS processing_date
FROM 
    user_interactions
GROUP BY 
    category;

-- Insert new rows with a processing timestamp
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id) AS user_sketch,
    CAST(NOW() AS TIMESTAMP) AS processing_date
FROM 
    new_user_interactions
GROUP BY 
    category;
    
-- Examine the dataset
SELECT * FROM category_sketches;

-- Now execute the merge
SELECT 
    category,
    hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM 
    category_sketches
GROUP BY 
    category;

The results of the SELECT will be

The aggregation will yield the same result:

K=2parameterK = 2^{\text{parameter}} K=2parameter
Relative Error≈1.04K\text{Relative Error} \approx \frac{1.04}{\sqrt{K}}Relative Error≈K​1.04​
PREP 500: Ingesting CSV Data into Adobe Experience Platform
56KB
user_interactions.csv
30KB
new_user_interactions.csv
Customer interactions
HLL sketch creation column.
Results of the query
Estimation of uniques
COUNT DISTINCT query
Merged sketches
Merged aggregate count
hll_merge_agg + hll_estimate approach
hll_merge_count approach yields same result
Multiple rows will show the aggregate.
Results obtained after aggregation
Results of the query with a time dimension
Results of the same query
Page cover image