STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics

Turning Big Data into Big Insights with Speed, Precision, and Scalability

Prerequisites

Ingest the following CSV files

56KB

user_interactions.csv

30KB

new_user_interactions.csv

by following the steps in the tutorial below

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Case Study Overview

E-commerce platforms generate an overwhelming amount of interaction data daily, capturing every click, view, and purchase. Understanding user behavior across product categories is essential for tailoring promotions, uncovering preferences, and ultimately driving revenue. However, the sheer scale of this data creates significant challenges, particularly when attempting to process it efficiently for actionable insights.

Data Distiller offers a solution by simplifying big data analysis through its use of hypercubes and sketches. These advanced tools enable efficient aggregation and processing of data, drastically reducing computational overhead. In this case study, we leverage Data Distiller to achieve three key objectives: counting distinct users per category, analyzing user behavior across multiple categories, and merging data efficiently without the need to reprocess historical datasets.

Analyzing e-commerce data requires addressing fundamental questions: How many unique users interacted with each category? What patterns emerge in cross-category behaviors? And how can insights be delivered without repeatedly recalculating metrics? Traditional systems fall short in this regard, often requiring the re-reading of raw data and recalculating metrics, which is both time-intensive and resource-heavy.

By utilizing hypercubes, Data Distiller overcomes these inefficiencies. It employs probabilistic data structures, such as sketches, to create compact, efficient summaries of datasets. This approach not only accelerates processing but also ensures scalability, allowing organizations to focus on driving insights and delivering value to their users.

Understanding the Dataset

The dataset represents simulated user interactions on an e-commerce platform, capturing a broad range of activity from 100 unique users over the course of November 2024. Each user, identified by a unique user_id (e.g., U1, U2), engages with the platform through multiple actions, interacting with products across various categories. These categories, including Electronics, Apparel, Home Goods, Books, and Beauty, reflect common e-commerce offerings and provide a foundation for analyzing user preferences and behaviors.

Each interaction is tied to a specific product_id (e.g., P101, P102), enabling detailed tracking of user-product engagements. The interaction_time field, recorded as a timestamp, offers insights into when these interactions occur, revealing temporal patterns such as peak shopping hours or specific dates of increased activity. The dataset spans the entire month, providing a comprehensive view of user activity over time.

User actions are categorized into three interaction_types: view, purchase, and cart_add. These types represent the customer journey, from initial product exploration to the decision to buy or save an item for later. By capturing these diverse actions, the dataset enables a deeper understanding of customer intent and conversion rates across different product categories.

This rich dataset is ideal for exploring questions such as: How many unique users interacted with each category? Which products or categories drive the most purchases? Are there patterns in user behavior across different times of the day or days of the week? It provides a solid foundation for analytics, segmentation, and predictive modeling, making it a valuable resource for developing strategies to enhance customer engagement and drive revenue.

Schema looks like the following:

user_id: Unique identifier for users.
product_id: Identifier for products.
category: Product category.
interaction_time: Timestamp of interaction.
interaction_type: Type of interaction (e.g., view, purchase).

HyperLogLog Sketches: The Key to Scalable and Efficient Big Data Insights on Unique Counts

Cardinality-based insights are critical for understanding true audience reach, optimizing resource allocation, and driving personalized user engagement. However, deriving these insights from traditional methods can be prohibitively expensive in terms of both computation and storage. This is where HyperLogLog (HLL) sketches come into play, revolutionizing how businesses calculate cardinality by offering a fast, scalable, and cost-efficient solution.

Traditional methods for computing cardinality involve storing and processing raw data to identify unique elements. For example, counting unique users interacting with our e-commerce platform over multiple campaigns requires repeatedly scanning through massive datasets, de-duplicating entries, and aggregating results. This approach demands substantial computational power and storage resources, which scale poorly as datasets grow. As a result, businesses face escalating infrastructure costs, slower query execution times, and significant delays in delivering actionable insights.

Additionally, traditional systems struggle with real-time analytics. To answer a simple question like, "How many unique users engaged with Campaign A over the last 30 days?" businesses must process historical data alongside new interactions, often leading to inefficiencies and delays.

HyperLogLog (HLL) sketches are a probabilistic data structure designed to estimate the cardinality, or the number of unique elements, within a dataset. Unlike traditional methods that store and process every individual element to compute distinct counts, HLL sketches use a compact representation that drastically reduces memory requirements. They achieve this efficiency by using hash functions to map dataset elements to binary values and then analyzing the patterns of trailing zeroes in these hashed values. The longer the sequence of trailing zeroes, the rarer the element, which provides a statistical basis for estimating the overall cardinality. The resulting HLL sketch is a small, fixed-size data object that can represent billions of unique items with a high degree of accuracy.

One of the key benefits of HLL sketches is their remarkable efficiency in handling large-scale datasets. Because the size of the sketch remains constant regardless of the dataset's size, they are highly scalable and suitable for big data applications. This efficiency makes them particularly valuable for systems that need to process streaming data or perform real-time analytics, as they can quickly update the sketch with minimal computational overhead.

Another significant advantage of HLL sketches is their ability to support operations like merging. By combining two or more sketches, it is possible to estimate the unique count of a union of datasets without accessing the original data. This property is incredibly useful in distributed systems where data is processed in parallel across multiple nodes. HLL sketches enable these systems to efficiently consolidate results and provide global insights with minimal communication overhead.

Use Case: Campaign Uniques Across Date Ranges

In marketing, one of the fundamental metrics is understanding how many unique users engage with a campaign over specific date ranges. Traditional methods of calculating unique users require processing raw data repeatedly, which becomes computationally expensive and slow as data scales. HyperLogLog (HLL) sketches provide a solution by offering compact and efficient cardinality estimation.

For example, consider a scenario where a campaign spans multiple weeks, and the goal is to understand unique user engagement week-by-week or across the entire campaign period. By leveraging HLL sketches, a sketch is created for each week's user interactions. These sketches, which represent the unique users for each week, can be stored and later merged to estimate the total number of unique users for the entire campaign without requiring access to the original data. This capability is particularly valuable for real-time reporting, as it eliminates the need to reprocess historical data whenever new information is added.

Furthermore, HLL sketches can be used to compare user engagement across date ranges. For instance, you might want to see how many users who interacted with the campaign in the first week returned in subsequent weeks. This overlap analysis becomes seamless with sketches, as you can compute intersections and unions of sketches across different periods to reveal trends, retention rates, and campaign effectiveness. These insights allow marketers to fine-tune their strategies, optimize engagement, and measure campaign ROI efficiently.

Use Case: Microsegments Along Various Dimensions

Segmentation is critical in personalized marketing, where campaigns are tailored to specific subsets of users based on their characteristics or behaviors. Microsegmentation takes this concept further, dividing users into highly granular groups based on multiple dimensions such as location, product preferences, device type, and interaction type. Calculating metrics like unique users for these microsegments can quickly become unmanageable as the number of dimensions and their combinations increase.

HyperLogLog sketches enable efficient microsegmentation by allowing unique counts to be computed along multiple dimensions without recalculating from raw data. For example, an e-commerce platform might create HLL sketches for users who viewed products, added them to the cart, or made a purchase, segmented by categories like "Electronics," "Apparel," or "Books." These sketches can then be further segmented by other dimensions such as geographical regions or device types. Marketers can instantly estimate the number of unique users in any segment or combination of segments without additional processing.

In practice, this allows businesses to identify high-value microsegments, such as users in a specific region who frequently purchase a particular product category. Additionally, HLL sketches can help track microsegment growth over time or analyze overlaps between segments, such as users who interact with multiple categories. By unlocking insights at this granular level, businesses can deliver hyper-targeted campaigns, enhance user experiences, and maximize conversion rates while maintaining scalability and efficiency in their data operations.

Use Case: Understanding True Audience Reach

In marketing, knowing the total number of unique users engaging with a campaign provides a clear picture of its actual reach. Without cardinality, repeated interactions from the same users might inflate metrics, leading to an overestimation of success. By accurately measuring unique engagements, businesses can assess the effectiveness of their campaigns, allocate resources more effectively, and ensure they are reaching the intended audience.

For instance, a campaign may generate 1 million clicks, but if only 100,000 unique users are responsible for those clicks, it indicates a concentration of activity among a small audience. This insight might prompt marketers to expand their targeting strategies to reach a broader demographic.

Create HyperLogLog (HLL) Sketches

To calculate distinct users for each category, we'll aggregate interactions using the hll_build_agg function. This function creates a compact sketch for estimating unique users.

CREATE TABLE category_sketches AS
SELECT 
    category,
    hll_build_agg(user_id, 10) AS user_sketch
FROM 
    user_interactions
GROUP BY 
    category;
    
SELECT * FROM category_sketches;

This SQL query creates a new table named category_sketches to store compact representations of unique user interactions with different product categories. It groups the data from the existing user_interactions table by the category column and applies the hll_build_agg function to the user_id column within each category. Additionally, the query specifies a parameter for the hll_build_agg function, which defines the precision of the HyperLogLog (HLL) sketch by setting the number of buckets used in the estimation. The HLL sketch, a probabilistic data structure, efficiently estimates the number of unique users (cardinality) in each category without storing or scanning all individual user IDs.

The resulting table, category_sketches, contains two columns: category, which identifies the product category, and user_sketch, which holds the HLL sketch for that category, configured with the specified precision level. By adjusting the parameter, the query balances accuracy and memory efficiency, making it adaptable for different use cases. This approach reduces data size and enables scalable, cost-effective cardinality calculations for insights such as audience reach or engagement patterns across categories.

Creation of the HLL sketch column in the table looks like the following in DBVisualizer:

This is what the result looks like after executing aSELECT query on the resulting dataset:

In this query result, the column labeled USER_SKETCH contains HyperLogLog (HLL) sketches, which are compact probabilistic representations of the unique users interacting within each category. These sketches are generated by the hll_build_agg function applied to the user_id column during the query.

Each sketch encodes the distinct user IDs for the corresponding CATEGORY (e.g., "Home Goods," "Apparel"). The encoded string in the USER_SKETCH column is not raw data but a fixed-size structure that estimates the cardinality (number of unique user IDs) efficiently. This enables large-scale datasets to be summarized in a memory-efficient manner, as the size of each sketch remains small regardless of the number of users in the category.

These sketches can be used in subsequent queries to quickly calculate the estimated unique user counts (hll_estimate), combine sketches from different categories (hll_merge_agg), or analyze overlaps between categories. This approach avoids repeatedly processing raw data, reducing computational cost and time while maintaining accuracy for decision-making.

All Data Distiller SQL queries for creation, merging, and estimating unique counts are fully functional across both the Data Lake and the Data Distiller Warehouse, also known as the Accelerated Store.

At present, sketch columns are immutable and cannot be updated after creation. However, future updates are expected to introduce functionality that allows for updating existing sketch columns. This enhancement will enable more effective handling of scenarios such as missed processing runs or late-arriving data, ensuring greater flexibility and accuracy in data management workflows.

Sometimes, you want to build a single HLL sketch that combines multiple unique identifiers from the same dataset. For example:

In a multi-channel marketing context, you might want to track a user's unique interactions across email, app, and web by combining email_id, app_user_id, and web_cookie_id into a single sketch.
In Adobe Real-Time Customer Data Platforms, users have multiple identifiers, combining these into a single sketch ensures accurate cardinality estimation across different data sources.

If our dataset includes email_id, app_user_id, and webcookie_id instead of a guaranteed user_id, you can use the COALESCE function to ensure that at least one non-null identifier is used for generating the HLL sketch:

CREATE TABLE category_sketches 
AS SELECT category, hll_build_agg(coalesce(email_id, app_user_id, webcookie_id), 10) AS user_sketch 
FROM user_interactions GROUP BY category;

Configuration Parameters in HLL Sketches

If you look at the code for hll_build_agg above, you will observe that it has a configuration parameter of 10. If you do not specify this value, the default value of 12 is chosen.

hll_build_agg(user_id, 10)

The configuration parameter specifies the log-base-2 of the number of buckets (K) used in the HLL sketch. Buckets are the internal data structures within the sketch used to estimate cardinality. Increasing the parameter increases the number of buckets, improving the precision of the cardinality estimate but also requiring more memory to store the sketch. The total number of buckets K is calculated as

K = 2^{\text{parameter}}

The valid range of the parameter is from 4 to 12

Minimum Value: 4 (16 buckets, low precision, very memory efficient). Lower values are sufficient for exploratory analysis.
Maximum Value: 12 (4096 buckets, high precision, higher memory usage). A high value may be required for highly sensitive financial or compliance reporting.

Confidence Intervals in HLL

In HLL sketches, the confidence interval is the range within which the true cardinality is expected to fall, given the estimated value. The size of this range is inversely proportional to K, the number of buckets. In simpler terms:

As K increases, the confidence interval becomes narrower, meaning the estimate is more precise.
A smaller K results in a wider confidence interval, meaning the estimate is less precise but requires less memory.

The confidence interval for HLL typically follows a standard format, such as:

\text{Relative Error} \approx \frac{1.04}{\sqrt{K}}

Implications of K for Confidence Intervals:

Higher value of K (e.g. parameter value of 12):
- K=4096 implies that the relative error is 0.016 (or 1.6%).
- The estimate will have a tight confidence interval, making it highly reliable.
- This configuration is useful for scenarios requiring high precision, such as compliance reporting or sensitive financial analytics.
Lower value of K (e.g. parameter value of 10):
- K=1024 implies that the relative error increases to 0.032 (or 3.2%).
- The confidence interval is slightly wider, making the estimate less precise but still sufficient for general analytics.
- This setup is memory-efficient and suitable for exploratory or real-time analytics where speed is prioritized over absolute precision.

Estimate Distinct User Counts

The hll_estimate function calculates the estimated number of unique users for each category.

SELECT 
    category,
    hll_estimate(user_sketch) AS distinct_users
FROM 
    category_sketches;

The result is:

If we had executed the above query in the old fashioned way:

SELECT 
    category,
    COUNT(DISTINCT user_id) AS distinct_users
FROM 
    user_interactions
GROUP BY 
    category;

The results are nearly identical due to the smaller data size, highlighting how sketches become significantly more efficient as the scale of the dataset increases.

Merge Sketches for Cross-Dimensional Analysis

Our use case is to calculate the total unique users across all categories. Instead of recomputing the distinct counts from raw data, we can use a merge function like hll_merge_agg, which deduplicates the unique IDs across each of these dimensions efficiently.

This query is specifically designed to merge the HyperLogLog (HLL) sketches from all the categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics") into a single, compact sketch. This merged sketch represents the estimated total unique users across all categories combined, ensuring that users appearing in multiple categories are only counted once.

To analyze behavior across categories, hll_merge_agg allows us to combine individual category-level sketches into a single sketch that maintains cardinality estimates without requiring access to the raw data. This approach is computationally efficient and scalable, making it ideal for handling large datasets or performing cross-category audience analysis.

SELECT 
    hll_merge_agg(user_sketch) AS merged_sketch
FROM 
    category_sketches;

The result looks like the following:

Estimate Overall Distinct Users

Our use case is to calculate the total number of distinct users across all categories while directly deriving the final estimated count. Instead of merging sketches and performing an additional estimation step, we can use the hll_merge_count_agg function, which not only combines the HyperLogLog (HLL) sketches from each category but also calculates the estimated total number of unique users in one step.

This query efficiently aggregates the HLL sketches from all categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics"), deduplicating unique IDs across these categories and directly returning the estimated count of distinct users. By using hll_merge_count_agg, we streamline the process of combining category-level sketches while avoiding overcounting users who interacted with multiple categories.

The function simplifies cross-category analysis by eliminating the need for a separate hll_estimate step after merging. This makes it ideal for scenarios where the primary objective is to retrieve the final count of unique users across all dimensions with minimal processing overhead, ensuring accuracy and scalability for large datasets.

SELECT 
    hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM 
    category_sketches;

The result looks like the following:

Two Approaches and Their Tradeoffs

Approach 1: `hll_merge_agg` + `hll_estimate`

SELECT 
    hll_estimate(
        hll_merge_agg(user_sketch)
    ) AS total_distinct_users
FROM 
    category_sketches;

The result will be:

When to use this approach

This approach is more flexible because the merged sketch can be reused for additional operations (e.g., further aggregations, intersections, or unions with other sketches) beyond just estimating the cardinality.
It is ideal if you need both the merged sketch for downstream use and the estimated count.

Approach 2: `hll_merge_count_agg`

SELECT 
    hll_merge_count_agg(user_sketch) AS total_distinct_users
FROM 
    category_sketches;

The result will be:

When to use this approach:

This approach is more streamlined and efficient when the goal is solely to get the final estimated count of distinct users.
It avoids creating an intermediate merged sketch, saving processing time and memory if the merged sketch is not needed for further analysis.

The Tradeoffs

Flexibility: The hll_merge_agg + hll_estimate approach provides an intermediate sketch (merged_sketch) that can be reused, offering more flexibility for additional operations. In contrast, hll_merge_count_agg is a one-step solution that calculates the count without producing a reusable sketch.
Efficiency: If your goal is just the final distinct count, hll_merge_count_agg is more efficient because it combines merging and estimation in a single operation.
Reusability: If further operations (e.g., intersections, unions, or additional merges) are needed with the combined data, hll_merge_agg is preferred because it generates a reusable merged sketch.

Both approaches yield the same estimated result when the goal is only to calculate the total number of distinct users. However, hll_merge_agg is more versatile, while hll_merge_count_agg is optimized for simplicity and efficiency when no further operations are required. Your choice depends on whether you need the intermediate sketch for additional analysis.

Incremental Additions to the Dataset

As previously mentioned, sketch columns are immutable and cannot be modified after they are created. However, new rows containing sketch columns can be added, and aggregations can be performed on these rows to incorporate the new data into the analysis.

-- Insert new rows 
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id, 10) AS user_sketch
FROM 
    new_interactions
GROUP BY 
    category;

-- Examine the dataset
SELECT * FROM category_sketches;

-- Now execute the merge
SELECT 
    category,
    hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM 
    category_sketches
GROUP BY 
    category;

TheSELECT query will show multiple rows:

The aggregate count query shows the following - ensure that you use the GROUP BYclause since you have multiple rows with the same category name

Ensure that the configuration parameter for bucketing i.e. K remains consistent across all INSERT and CREATE queries. This is crucial because the merge and aggregation functions require all sketches to have the same number of buckets in order to work correctly. Inconsistent bucketing configurations will result in errors during these operations.

Best Practice with Incremental Additions

To effectively manage and track data updates when creating new rows with aggregates, it's important to include a timestamp column that records the day of processing. This timestamp ensures that each new block of data can be tied to its processing date, enabling better traceability, data auditing, and incremental updates. By recording the processing date, you can differentiate between historical and newly added data, making it easier to debug, analyze trends, and optimize queries. This approach is especially useful in scenarios where data arrives in batches or where late-arriving data needs to be incorporated incrementally.

You will need to rewrite the query the following way and execute it block by block:

-- Create the empty dataset first
CREATE TABLE category_sketches AS
SELECT
  CAST(NULL AS STRING) AS category,
  CAST(NULL AS STRING) AS user_sketch,
  CAST(NULL AS TIMESTAMP) AS processing_date
WHERE FALSE;

-- Insert backfill data with a processing timestamp
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id) AS user_sketch,
    CAST(NOW() AS TIMESTAMP) AS processing_date
FROM 
    user_interactions
GROUP BY 
    category;

-- Insert new rows with a processing timestamp
INSERT INTO category_sketches
SELECT 
    category,
    hll_build_agg(user_id) AS user_sketch,
    CAST(NOW() AS TIMESTAMP) AS processing_date
FROM 
    new_user_interactions
GROUP BY 
    category;
    
-- Examine the dataset
SELECT * FROM category_sketches;

-- Now execute the merge
SELECT 
    category,
    hll_merge_count_agg(user_sketch) AS updated_distinct_users
FROM 
    category_sketches
GROUP BY 
    category;

The results of the SELECT will be

The aggregation will yield the same result:

Last updated 7 months ago

Prerequisites

Case Study Overview

Understanding the Dataset

HyperLogLog Sketches: The Key to Scalable and Efficient Big Data Insights on Unique Counts

Use Case: Campaign Uniques Across Date Ranges

Use Case: Microsegments Along Various Dimensions

Use Case: Understanding True Audience Reach

Create HyperLogLog (HLL) Sketches

Configuration Parameters in HLL Sketches

Confidence Intervals in HLL

Estimate Distinct User Counts

Merge Sketches for Cross-Dimensional Analysis

Estimate Overall Distinct Users

Two Approaches and Their Tradeoffs

Approach 1: hll_merge_agg + hll_estimate

Approach 2: hll_merge_count_agg

The Tradeoffs

Incremental Additions to the Dataset

Best Practice with Incremental Additions

Approach 1: `hll_merge_agg` + `hll_estimate`

Approach 2: `hll_merge_count_agg`