STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
Turning Big Data into Big Insights with Speed, Precision, and Scalability
Last updated
Turning Big Data into Big Insights with Speed, Precision, and Scalability
Last updated
Ingest the following CSV files
by following the steps in the tutorial below
E-commerce platforms generate an overwhelming amount of interaction data daily, capturing every click, view, and purchase. Understanding user behavior across product categories is essential for tailoring promotions, uncovering preferences, and ultimately driving revenue. However, the sheer scale of this data creates significant challenges, particularly when attempting to process it efficiently for actionable insights.
Data Distiller offers a solution by simplifying big data analysis through its use of hypercubes and sketches. These advanced tools enable efficient aggregation and processing of data, drastically reducing computational overhead. In this case study, we leverage Data Distiller to achieve three key objectives: counting distinct users per category, analyzing user behavior across multiple categories, and merging data efficiently without the need to reprocess historical datasets.
Analyzing e-commerce data requires addressing fundamental questions: How many unique users interacted with each category? What patterns emerge in cross-category behaviors? And how can insights be delivered without repeatedly recalculating metrics? Traditional systems fall short in this regard, often requiring the re-reading of raw data and recalculating metrics, which is both time-intensive and resource-heavy.
By utilizing hypercubes, Data Distiller overcomes these inefficiencies. It employs probabilistic data structures, such as sketches, to create compact, efficient summaries of datasets. This approach not only accelerates processing but also ensures scalability, allowing organizations to focus on driving insights and delivering value to their users.
The dataset represents simulated user interactions on an e-commerce platform, capturing a broad range of activity from 100 unique users over the course of November 2024. Each user, identified by a unique user_id
(e.g., U1
, U2
), engages with the platform through multiple actions, interacting with products across various categories. These categories, including Electronics, Apparel, Home Goods, Books, and Beauty, reflect common e-commerce offerings and provide a foundation for analyzing user preferences and behaviors.
Each interaction is tied to a specific product_id
(e.g., P101
, P102
), enabling detailed tracking of user-product engagements. The interaction_time
field, recorded as a timestamp, offers insights into when these interactions occur, revealing temporal patterns such as peak shopping hours or specific dates of increased activity. The dataset spans the entire month, providing a comprehensive view of user activity over time.
User actions are categorized into three interaction_type
s: view, purchase, and cart_add. These types represent the customer journey, from initial product exploration to the decision to buy or save an item for later. By capturing these diverse actions, the dataset enables a deeper understanding of customer intent and conversion rates across different product categories.
This rich dataset is ideal for exploring questions such as: How many unique users interacted with each category? Which products or categories drive the most purchases? Are there patterns in user behavior across different times of the day or days of the week? It provides a solid foundation for analytics, segmentation, and predictive modeling, making it a valuable resource for developing strategies to enhance customer engagement and drive revenue.
Schema looks like the following:
user_id: Unique identifier for users.
product_id: Identifier for products.
category: Product category.
interaction_time: Timestamp of interaction.
interaction_type: Type of interaction (e.g., view, purchase).
Cardinality-based insights are critical for understanding true audience reach, optimizing resource allocation, and driving personalized user engagement. However, deriving these insights from traditional methods can be prohibitively expensive in terms of both computation and storage. This is where HyperLogLog (HLL) sketches come into play, revolutionizing how businesses calculate cardinality by offering a fast, scalable, and cost-efficient solution.
Traditional methods for computing cardinality involve storing and processing raw data to identify unique elements. For example, counting unique users interacting with our e-commerce platform over multiple campaigns requires repeatedly scanning through massive datasets, de-duplicating entries, and aggregating results. This approach demands substantial computational power and storage resources, which scale poorly as datasets grow. As a result, businesses face escalating infrastructure costs, slower query execution times, and significant delays in delivering actionable insights.
Additionally, traditional systems struggle with real-time analytics. To answer a simple question like, "How many unique users engaged with Campaign A over the last 30 days?" businesses must process historical data alongside new interactions, often leading to inefficiencies and delays.
HyperLogLog (HLL) sketches are a probabilistic data structure designed to estimate the cardinality, or the number of unique elements, within a dataset. Unlike traditional methods that store and process every individual element to compute distinct counts, HLL sketches use a compact representation that drastically reduces memory requirements. They achieve this efficiency by using hash functions to map dataset elements to binary values and then analyzing the patterns of trailing zeroes in these hashed values. The longer the sequence of trailing zeroes, the rarer the element, which provides a statistical basis for estimating the overall cardinality. The resulting HLL sketch is a small, fixed-size data object that can represent billions of unique items with a high degree of accuracy.
One of the key benefits of HLL sketches is their remarkable efficiency in handling large-scale datasets. Because the size of the sketch remains constant regardless of the dataset's size, they are highly scalable and suitable for big data applications. This efficiency makes them particularly valuable for systems that need to process streaming data or perform real-time analytics, as they can quickly update the sketch with minimal computational overhead.
Another significant advantage of HLL sketches is their ability to support operations like merging. By combining two or more sketches, it is possible to estimate the unique count of a union of datasets without accessing the original data. This property is incredibly useful in distributed systems where data is processed in parallel across multiple nodes. HLL sketches enable these systems to efficiently consolidate results and provide global insights with minimal communication overhead.
In marketing, one of the fundamental metrics is understanding how many unique users engage with a campaign over specific date ranges. Traditional methods of calculating unique users require processing raw data repeatedly, which becomes computationally expensive and slow as data scales. HyperLogLog (HLL) sketches provide a solution by offering compact and efficient cardinality estimation.
For example, consider a scenario where a campaign spans multiple weeks, and the goal is to understand unique user engagement week-by-week or across the entire campaign period. By leveraging HLL sketches, a sketch is created for each week's user interactions. These sketches, which represent the unique users for each week, can be stored and later merged to estimate the total number of unique users for the entire campaign without requiring access to the original data. This capability is particularly valuable for real-time reporting, as it eliminates the need to reprocess historical data whenever new information is added.
Furthermore, HLL sketches can be used to compare user engagement across date ranges. For instance, you might want to see how many users who interacted with the campaign in the first week returned in subsequent weeks. This overlap analysis becomes seamless with sketches, as you can compute intersections and unions of sketches across different periods to reveal trends, retention rates, and campaign effectiveness. These insights allow marketers to fine-tune their strategies, optimize engagement, and measure campaign ROI efficiently.
Segmentation is critical in personalized marketing, where campaigns are tailored to specific subsets of users based on their characteristics or behaviors. Microsegmentation takes this concept further, dividing users into highly granular groups based on multiple dimensions such as location, product preferences, device type, and interaction type. Calculating metrics like unique users for these microsegments can quickly become unmanageable as the number of dimensions and their combinations increase.
HyperLogLog sketches enable efficient microsegmentation by allowing unique counts to be computed along multiple dimensions without recalculating from raw data. For example, an e-commerce platform might create HLL sketches for users who viewed products, added them to the cart, or made a purchase, segmented by categories like "Electronics," "Apparel," or "Books." These sketches can then be further segmented by other dimensions such as geographical regions or device types. Marketers can instantly estimate the number of unique users in any segment or combination of segments without additional processing.
In practice, this allows businesses to identify high-value microsegments, such as users in a specific region who frequently purchase a particular product category. Additionally, HLL sketches can help track microsegment growth over time or analyze overlaps between segments, such as users who interact with multiple categories. By unlocking insights at this granular level, businesses can deliver hyper-targeted campaigns, enhance user experiences, and maximize conversion rates while maintaining scalability and efficiency in their data operations.
In marketing, knowing the total number of unique users engaging with a campaign provides a clear picture of its actual reach. Without cardinality, repeated interactions from the same users might inflate metrics, leading to an overestimation of success. By accurately measuring unique engagements, businesses can assess the effectiveness of their campaigns, allocate resources more effectively, and ensure they are reaching the intended audience.
For instance, a campaign may generate 1 million clicks, but if only 100,000 unique users are responsible for those clicks, it indicates a concentration of activity among a small audience. This insight might prompt marketers to expand their targeting strategies to reach a broader demographic.
To calculate distinct users for each category, we'll aggregate interactions using the hll_build_agg
function. This function creates a compact sketch for estimating unique users.
This SQL query creates a new table named category_sketches
to store compact representations of unique user interactions with different product categories. It groups the data from the existing user_interactions
table by the category
column and applies the hll_build_agg
function to the user_id
column within each category. Additionally, the query specifies a parameter for the hll_build_agg
function, which defines the precision of the HyperLogLog (HLL) sketch by setting the number of buckets used in the estimation. The HLL sketch, a probabilistic data structure, efficiently estimates the number of unique users (cardinality) in each category without storing or scanning all individual user IDs.
The resulting table, category_sketches
, contains two columns: category
, which identifies the product category, and user_sketch
, which holds the HLL sketch for that category, configured with the specified precision level. By adjusting the parameter, the query balances accuracy and memory efficiency, making it adaptable for different use cases. This approach reduces data size and enables scalable, cost-effective cardinality calculations for insights such as audience reach or engagement patterns across categories.
Creation of the HLL sketch column in the table looks like the following in DBVisualizer:
This is what the result looks like after executing aSELECT
query on the resulting dataset:
In this query result, the column labeled USER_SKETCH contains HyperLogLog (HLL) sketches, which are compact probabilistic representations of the unique users interacting within each category. These sketches are generated by the hll_build_agg
function applied to the user_id
column during the query.
Each sketch encodes the distinct user IDs for the corresponding CATEGORY (e.g., "Home Goods," "Apparel"). The encoded string in the USER_SKETCH
column is not raw data but a fixed-size structure that estimates the cardinality (number of unique user IDs) efficiently. This enables large-scale datasets to be summarized in a memory-efficient manner, as the size of each sketch remains small regardless of the number of users in the category.
These sketches can be used in subsequent queries to quickly calculate the estimated unique user counts (hll_estimate
), combine sketches from different categories (hll_merge_agg
), or analyze overlaps between categories. This approach avoids repeatedly processing raw data, reducing computational cost and time while maintaining accuracy for decision-making.
All Data Distiller SQL queries for creation, merging, and estimating unique counts are fully functional across both the Data Lake and the Data Distiller Warehouse, also known as the Accelerated Store.
At present, sketch columns are immutable and cannot be updated after creation. However, future updates are expected to introduce functionality that allows for updating existing sketch columns. This enhancement will enable more effective handling of scenarios such as missed processing runs or late-arriving data, ensuring greater flexibility and accuracy in data management workflows.
Sometimes, you want to build a single HLL sketch that combines multiple unique identifiers from the same dataset. For example:
In a multi-channel marketing context, you might want to track a user's unique interactions across email, app, and web by combining email_id
, app_user_id
, and web_cookie_id
into a single sketch.
In Adobe Real-Time Customer Data Platforms, users have multiple identifiers, combining these into a single sketch ensures accurate cardinality estimation across different data sources.
If our dataset includes email_id
, app_user_id
, and webcookie_id
instead of a guaranteed user_id
, you can use the COALESCE
function to ensure that at least one non-null identifier is used for generating the HLL sketch:
If you look at the code for hll_build_agg
above, you will observe that it has a configuration parameter of 10. If you do not specify this value, the default value of 12 is chosen.
The configuration parameter specifies the log-base-2 of the number of buckets (K) used in the HLL sketch. Buckets are the internal data structures within the sketch used to estimate cardinality. Increasing the parameter increases the number of buckets, improving the precision of the cardinality estimate but also requiring more memory to store the sketch. The total number of buckets K
is calculated as
The valid range of the parameter is from 4 to 12
Minimum Value: 4
(16 buckets, low precision, very memory efficient). Lower values are sufficient for exploratory analysis.
Maximum Value: 12
(4096 buckets, high precision, higher memory usage). A high value may be required for highly sensitive financial or compliance reporting.
In HLL sketches, the confidence interval is the range within which the true cardinality is expected to fall, given the estimated value. The size of this range is inversely proportional to K, the number of buckets. In simpler terms:
As K increases, the confidence interval becomes narrower, meaning the estimate is more precise.
A smaller K results in a wider confidence interval, meaning the estimate is less precise but requires less memory.
The confidence interval for HLL typically follows a standard format, such as:
Implications of K for Confidence Intervals:
Higher value of K (e.g. parameter value of 12):
K=4096 implies that the relative error is 0.016 (or 1.6%).
The estimate will have a tight confidence interval, making it highly reliable.
This configuration is useful for scenarios requiring high precision, such as compliance reporting or sensitive financial analytics.
Lower value of K (e.g. parameter value of 10):
K=1024 implies that the relative error increases to 0.032 (or 3.2%).
The confidence interval is slightly wider, making the estimate less precise but still sufficient for general analytics.
This setup is memory-efficient and suitable for exploratory or real-time analytics where speed is prioritized over absolute precision.
The hll_estimate
function calculates the estimated number of unique users for each category.
The result is:
If we had executed the above query in the old fashioned way:
The results are nearly identical due to the smaller data size, highlighting how sketches become significantly more efficient as the scale of the dataset increases.
Our use case is to calculate the total unique users across all categories. Instead of recomputing the distinct counts from raw data, we can use a merge function like hll_merge_agg
, which deduplicates the unique IDs across each of these dimensions efficiently.
This query is specifically designed to merge the HyperLogLog (HLL) sketches from all the categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics") into a single, compact sketch. This merged sketch represents the estimated total unique users across all categories combined, ensuring that users appearing in multiple categories are only counted once.
To analyze behavior across categories, hll_merge_agg
allows us to combine individual category-level sketches into a single sketch that maintains cardinality estimates without requiring access to the raw data. This approach is computationally efficient and scalable, making it ideal for handling large datasets or performing cross-category audience analysis.
The result looks like the following:
Our use case is to calculate the total number of distinct users across all categories while directly deriving the final estimated count. Instead of merging sketches and performing an additional estimation step, we can use the hll_merge_count_agg
function, which not only combines the HyperLogLog (HLL) sketches from each category but also calculates the estimated total number of unique users in one step.
This query efficiently aggregates the HLL sketches from all categories (e.g., "Home Goods," "Apparel," "Books," "Beauty," and "Electronics"), deduplicating unique IDs across these categories and directly returning the estimated count of distinct users. By using hll_merge_count_agg
, we streamline the process of combining category-level sketches while avoiding overcounting users who interacted with multiple categories.
The function simplifies cross-category analysis by eliminating the need for a separate hll_estimate
step after merging. This makes it ideal for scenarios where the primary objective is to retrieve the final count of unique users across all dimensions with minimal processing overhead, ensuring accuracy and scalability for large datasets.
The result looks like the following:
hll_merge_agg
+ hll_estimate
The result will be:
When to use this approach
This approach is more flexible because the merged sketch can be reused for additional operations (e.g., further aggregations, intersections, or unions with other sketches) beyond just estimating the cardinality.
It is ideal if you need both the merged sketch for downstream use and the estimated count.
hll_merge_count_agg
The result will be:
When to use this approach:
This approach is more streamlined and efficient when the goal is solely to get the final estimated count of distinct users.
It avoids creating an intermediate merged sketch, saving processing time and memory if the merged sketch is not needed for further analysis.
Flexibility: The hll_merge_agg
+ hll_estimate
approach provides an intermediate sketch (merged_sketch
) that can be reused, offering more flexibility for additional operations. In contrast, hll_merge_count_agg
is a one-step solution that calculates the count without producing a reusable sketch.
Efficiency: If your goal is just the final distinct count, hll_merge_count_agg
is more efficient because it combines merging and estimation in a single operation.
Reusability: If further operations (e.g., intersections, unions, or additional merges) are needed with the combined data, hll_merge_agg
is preferred because it generates a reusable merged sketch.
Both approaches yield the same estimated result when the goal is only to calculate the total number of distinct users. However, hll_merge_agg
is more versatile, while hll_merge_count_agg
is optimized for simplicity and efficiency when no further operations are required. Your choice depends on whether you need the intermediate sketch for additional analysis.
As previously mentioned, sketch columns are immutable and cannot be modified after they are created. However, new rows containing sketch columns can be added, and aggregations can be performed on these rows to incorporate the new data into the analysis.
TheSELECT
query will show multiple rows:
The aggregate count query shows the following - ensure that you use the GROUP BY
clause since you have multiple rows with the same category name
Ensure that the configuration parameter for bucketing i.e. K remains consistent across all INSERT
and CREATE
queries. This is crucial because the merge
and aggregation
functions require all sketches to have the same number of buckets in order to work correctly. Inconsistent bucketing configurations will result in errors during these operations.
To effectively manage and track data updates when creating new rows with aggregates, it's important to include a timestamp column that records the day of processing. This timestamp ensures that each new block of data can be tied to its processing date, enabling better traceability, data auditing, and incremental updates. By recording the processing date, you can differentiate between historical and newly added data, making it easier to debug, analyze trends, and optimize queries. This approach is especially useful in scenarios where data arrives in batches or where late-arriving data needs to be incorporated incrementally.
You will need to rewrite the query the following way and execute it block by block:
The results of the SELECT
will be
The aggregation will yield the same result: