STATSML 400: Data Distiller Basic Statistics Functions
Unlock the Power of Data: Master Every Key Statistical Function in Data Distiller
Last updated
Unlock the Power of Data: Master Every Key Statistical Function in Data Distiller
Last updated
You need to ingest the CSV file below using the following tutorial:
To demonstrate the use of statistical functions in a marketing domain dataset, let's generate a dataset representing customer transactions and campaign performance. The dataset includes information about customer purchases, campaign engagement, and customer demographics. Then, I'll provide SQL examples for each statistical function along with a suitable use case.
When you ingest this dataset, make sure you name it as marketing_campaign_data
The dataset is related to a marketing campaign and contains the following columns:
customer_id: A unique identifier for each customer.
campaign_id: The identifier for the marketing campaign the customer participated in.
purchase_amount: The amount of money the customer spent during the campaign.
engagement_score: A score indicating the level of customer engagement in the campaign.
age: The age of the customer.
clv (Customer Lifetime Value): An estimated value of the customer's future spending.
avg
)"Average" typically refers to the mean, which is the sum of all values in a dataset divided by the number of values. It represents a measure of central tendency, indicating the central point of a dataset. The mean provides a summary of the data by finding a single value that represents the overall level of all observations.
To calculate the mean, you add up all the data points and divide by the total number of points. For example, if you have a dataset of five numbers: 4, 8, 6, 5, and 7, the mean would be (4+8+6+5+7)/5=6.
The mean is useful for understanding the overall trend of numerical data, but it can be sensitive to outliers, which are values significantly higher or lower than the others. Unlike the median (middle value) or the mode (most frequent value), the mean takes into account all data points when summarizing the dataset.
Let us calculate the average purchase amount to assess the overall customer spend.
sum
)Let us calculate the total customer lifetime value (CLV) for all customers engaged in a specific campaign.
min
, max
)Let us identify the minimum and maximum customer engagement scores for a campaign to gauge campaign effectiveness.
stddev/stddev_pop/stddev_samp
)In statistics, "standard deviation" is a measure of the dispersion or spread of a set of values around the mean (average). It indicates how much individual data points deviate, on average, from the mean. A low standard deviation means that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range.
Standard deviation is calculated as the square root of the variance, which is the average of the squared deviations from the mean. It is commonly used in various fields to assess the variability or consistency of data. Unlike the mean (central value) and the median (middle value), the standard deviation focuses on the extent of variation or dispersion in the dataset.
Note that the stddev
function is an alias for stddev_samp
. It calculates the sample standard deviation, using N−1
as the divisor (where N
is the total number of data points). This adjustment is known as Bessel's correction, and it accounts for the bias in estimating the population standard deviation from a sample. stddev_pop
computes the population standard deviation. It uses N
as the divisor, treating the data as the entire population.
The stddev/stddev_samp
is computed as
Let us measure the variability in customer age to assess the diversity of your customer base:
The above sample computation is very useful when you need to construct a confidence interval for the mean of a dataset, you need to use the sample standard deviation along with other statistical elements, such as:
Sample Mean: The average of the sample data.
Standard Error of the Mean (SE): Calculated as
Critical Value (z-score or t-score): Depends on the desired confidence level (e.g., 1.96 for 95% confidence if using the normal distribution).
The confidence interval is then calculated as:
stddev_pop)
Whether a dataset is considered a population or a sample depends on the context and the scope of the analysis. When the dataset includes all possible observations relevant to the study, it is considered a population. For example, if you have the entire customer base of a company and want to analyze their spending habits, that dataset would be treated as the population. In this case, the population standard deviation (stddev_pop
) is used because the data represents the entire group, and no adjustments are necessary.
On the other hand, a dataset is considered a sample when it is a subset of the population, meant to represent a larger group. For instance, if you survey 1,000 customers out of a larger group of 100,000 to understand general customer preferences, this dataset would be considered a sample. In such cases, the sample standard deviation (stddev_samp
) is used because an adjustment is needed to account for the fact that the data is only a subset. This adjustment, known as Bessel's correction, compensates for potential bias when estimating the population characteristics from the sample.
The distinction between population and sample also depends on the context in which the data was collected. If the data was gathered through a survey, experiment, or sampling process, it is generally treated as a sample. Additionally, if the goal of the analysis is to make inferences about a larger group beyond the dataset itself, it should be considered a sample. Even if the dataset is large, it may still be a sample if it does not cover all possible observations. Conversely, small datasets can be populations if they include every relevant case. In practice, data is most often treated as a sample, as it is rare to have data for the entire population.
In most practical scenarios, data is treated as a sample because it's rare to have data for the entire population.
This gives the range in which the true population mean is likely to fall with the specified level of confidence.
The formula is:
This gives for age:
The results are similar as the dataset has enough data points but you can see differences in the least significant digits.
variance/var_pop/var_samp
)The same principles apply to variance as they do for standard deviation, since variance is essentially the square of the standard deviation. variance
is the same asvar_samp.
The formulas and assumptions remain the same as previously explained. In most cases, you will be using variance
(or var_samp
).
Our use case will be to determine the variance in customer engagement scores to see how consistently customers interact with campaigns.
median
)"Median" refers to the middle value of a dataset when the numbers are arranged in ascending or descending order. It represents the point at which half of the data falls below and half falls above. If the dataset has an odd number of observations, the median is the middle value. If the dataset has an even number of observations, the median is the average of the two middle values.
The median is particularly useful for numerical data, especially when the data is skewed or contains outliers, as it is less affected by extreme values than the mean (average). In contrast to the mode (most frequent value) and the mean, the median provides a measure of central tendency that indicates the midpoint of the dataset.
Let us calculate the median purchase amount to understand the central spending tendency of customers.
(mod)
"Mod" typically refers to the mode, which is the value that appears most frequently in a dataset. It represents the data point or category that occurs with the highest frequency. The mode can be used for both numerical and categorical data. For example, in a dataset of people's favorite ice cream flavors, the mode would be the flavor that the largest number of people prefer. In a numerical dataset, it would be the number that appears most often.
In contrast to the mean (average) and median (middle value), the mode focuses on the most common value in the dataset.
Distribute customers evenly into 3 random marketing groups for campaign analysis.
corr
)Correlation measures the strength and direction of a linear relationship between two variables. It is expressed as a correlation coefficient, denoted by rrr, which ranges from -1 to 1. A correlation coefficient close to 1 indicates a strong positive linear relationship, meaning that as one variable increases, the other tends to increase as well. Conversely, a correlation coefficient close to -1 suggests a strong negative linear relationship, where an increase in one variable corresponds to a decrease in the other.
When the correlation coefficient is close to 0, it indicates little to no linear relationship between the variables; changes in one variable do not reliably predict changes in the other. Correlation does not imply causation, meaning that even if two variables are correlated, it does not necessarily mean that one variable causes the other to change. Correlation is useful for identifying relationships in data, and it is commonly used in fields like finance, psychology, and social sciences to uncover patterns and make predictions based on observed trends.
Let us check if there is a correlation between customer age and their engagement score with campaigns.
Here is how to interpret the results:
For our use case, the given result of approximately r=0.0067
, this falls into the "No Correlation" category, indicating that there is essentially no linear relationship between age and engagement score in our dataset.
Correlation is more commonly used than covariance because it standardizes the relationship between variables, making comparisons easier. However, covariance is a key component in the calculation of correlation and provides valuable directional insights into how two variables move together.
covar_pop/covar_samp
)Covariance measures the degree to which two variables change together. It indicates the direction of the linear relationship between the variables. If the covariance is positive, it means that as one variable increases, the other tends to increase as well, indicating a positive relationship. Conversely, a negative covariance suggests that as one variable increases, the other tends to decrease, indicating an inverse relationship.
The magnitude of the covariance value indicates the strength of the relationship; however, unlike correlation, it does not provide a standardized measure. This means that the actual value of covariance can be difficult to interpret because it depends on the scale of the variables. Covariance is used in various fields, such as finance, where it helps in understanding how different assets move together, which is useful for portfolio diversification. While it indicates the direction of a relationship, it does not measure the strength or causality between the variables.
Covariance becomes a correlation when it is standardized. The correlation coefficient is essentially a scaled version of covariance, which adjusts for the variability (standard deviation) of each variable, making it a unitless measure. This allows for a direct comparison of relationships regardless of the original scales of the variables.
By dividing the covariance by the product of the standard deviations of X
and Y
variables, you normalize the value, bringing it into the range of -1 to 1.
Let us compute the covariance between purchase amount and engagement score to see if higher engagement leads to higher spending.
Just like the way we used functions for the population and sample, the formulas are the following:
Population Covariance
where you have sample means subtracted from each value for both X
and Y
Sample Covariance
Let us calculate the population covariance between customer age and lifetime value (CLV) to understand if older customers tend to have higher value.
Remember:
The magnitude of covariance is influenced by the units of the variables, so the absolute value is not directly indicative of the strength of the relationship.
Unlike correlation, covariance is not standardized, meaning it is not constrained within a fixed range (such as -1 to 1), making direct comparisons across datasets less meaningful without normalization.
skewness
)Skewness measures the asymmetry of a dataset's distribution. It indicates whether the data points are spread more towards one side of the mean, resulting in a non-symmetric shape. Skewness can be positive, negative, or zero, depending on the direction of the asymmetry:
Positive Skewness (Right-Skewed): When skewness is greater than zero, the distribution has a long tail on the right side. This means that there are more values concentrated on the left, with a few larger values stretching the distribution to the right.
Negative Skewness (Left-Skewed): When skewness is less than zero, the distribution has a long tail on the left side. In this case, more values are concentrated on the right, with a few smaller values stretching the distribution to the left.
Zero Skewness (Symmetrical Distribution): When skewness is approximately zero, the distribution is symmetric, with data points evenly distributed on both sides of the mean. A perfectly symmetric distribution, such as a normal distribution, has zero skewness.
Skewness helps to identify the extent and direction of deviation from a normal distribution, and it is useful for understanding the nature of the data, particularly in fields like finance, economics, and quality control.
Let us determine if the distribution of purchase amounts is skewed towards lower or higher values.
The results of the above query are:
The result of the skewness calculation for purchase_amount
is approximately -0.00015. This value is very close to zero, which indicates that the distribution of purchase_amount
is nearly symmetric.
kurtosis
)Kurtosis measures the "tailedness" or the sharpness of the peak of a dataset's distribution. It indicates how much of the data is concentrated in the tails and the peak compared to a normal distribution. Kurtosis helps to understand the distribution's shape, particularly the presence of outliers.
Let us assess the "peakedness" of customer engagement scores to understand if most scores are concentrated around the mean.
The result of the kurtosis calculation for engagement_score
is approximately -1.1989. This value is less than 3, indicating that the distribution is platykurtic. The kurtosis value of -1.1989 suggests that the engagement_score
distribution has fewer extreme values (outliers) than a normal distribution. The data points are more spread out across the range, with less concentration around the peak.
In a normal distribution, the data is symmetrically spread, with most values clustering around the mean, and the frequency of extreme values decreases as you move away from the mean. When a distribution has no significant excess in outliers, it means that the occurrence of data points far from the center is what you would expect based on a normal distribution, with no additional concentration of extreme values in the tails.
count
)Let us count the number of customers engaged in each marketing campaign to understand campaign reach.
count_if
)Let us count how many customers have spent more than $200 in each campaign to identify high spenders.
approx_count_distinct
)The Approximate Count Distinct (approx_count_distinct
) function offers significant advantages over the traditional Count Distinct (count distinct
) function, especially when working with large datasets. It employs algorithms like HyperLogLog to estimate the number of distinct values, providing a high degree of accuracy while being much faster and more efficient than count distinct
. This speed is achieved because approx_count_distinct
does not need to store and sort all unique values, making it particularly useful in big data environments where datasets may be too large to fit into memory. Additionally, the function consumes less memory by using probabilistic methods, enabling distinct counting on massive datasets without overwhelming system resources. As a result, approx_count_distinct
scales well with increasing data size, making it an ideal choice for distributed computing platforms where performance and scalability are critical.
Let us estimate the number of unique customers engaged with a specific marketing campaign.
(rand/random)
The rand()
function is a mathematical function used to generate a random floating-point number between 0 (inclusive) and 1 (exclusive) from a uniform distribution. Each time rand()
is called, it produces a different pseudo-random number, simulating randomness. However, because it is based on an algorithm rather than true randomness, the sequence of numbers generated is actually deterministic if the initial starting point (seed) is known.
Suppose you want to randomly assign customers to different marketing test groups for A/B testing.
In this example, customers are assigned randomly to Group A or Group B based on the random value generated by rand()
.
If you want to use a seed for predictability, try this:
random()
is the same as rand()
. Both functions generate random numbers uniformly distributed between 0 (inclusive) and 1 (exclusive). They are interchangeable and serve the same purpose for creating random values in this range.
randn)
The randn()
function generates random numbers following a normal (Gaussian) distribution, with a mean of 0 and a standard deviation of 1. Unlike rand()
, the values produced by randn()
are not limited to a specific range and can be any real number, though most values will fall within three standard deviations of the mean. This function is particularly useful for modeling data that follows a bell-curve shape, where most observations cluster around the central value, such as natural phenomena, measurement errors, or financial returns.
Let us simulate customer engagement scores or create noise in the data to make the model more robust for training.
In this case, the randn()
function adds normally distributed noise to the customer engagement scores, simulating potential fluctuations in real-world data.
If you want to use a seed for predictability:
rank
)Let us rank customers by their purchase amount within each campaign to identify top spenders.
The query retrieves data from the marketing_campaign_data
table, selecting the customer_id
, campaign_id
, and purchase_amount
columns, along with a calculated rank
. The rank()
function is used to assign a ranking to each row within each campaign_id
group (using PARTITION BY campaign_id
). The rows are ordered by purchase_amount
in descending order (ORDER BY purchase_amount DESC
), meaning the highest purchase_amount
within each campaign gets a rank of 1, the second highest gets a rank of 2, and so on. This approach allows for ranking customers based on their purchase amounts within each specific campaign, enabling comparisons and analysis of customer spending behavior across different marketing campaigns.
first
)Find the first customer by engagement score in each campaign to track early adopters.
last
)Identify the last customer to make a purchase in each campaign to track lagging engagement.
percent_rank
)Calculate the percent rank of customers based on their purchase amount within each campaign to categorize customer spending.
(percentile or percentile_approx)
A percentile is a measure that indicates the value below which a given percentage of observations in a dataset falls. For example, the 25th percentile is the value below which 25% of the data points lie, while the 90th percentile is the value below which 90% of the data points fall. Percentiles help in understanding the distribution of data by dividing it into 100 equal parts.
Percentiles are commonly used in data analysis to assess the relative standing of individual observations within a dataset. They are particularly useful for identifying outliers, comparing different data sets, or summarizing large amounts of data. In educational testing, for example, if a student's score is in the 85th percentile, it means they scored higher than 85% of the other students. Percentiles provide a way to interpret data in terms of rank and position rather than exact values.
The use of percentile/percentile_approx
are both approximate percentiles and in a query provides a significant performance advantage, especially when working with large datasets. Unlike exact percentile calculations, both estimate by using algorithms that avoid the need to sort all the data. This approach results in faster execution and lower memory usage, making it highly suitable for big data environments where datasets can be massive. The function also scales efficiently allowing it to handle very large datasets seamlessly. Although it provides an approximate value rather than an exact percentile, the trade-off is often worthwhile for the speed and resource efficiency it offers.
Let us calculate the 90th percentile of customer engagement scores to identify top-performing customers who are highly engaged with a marketing campaign.
percentile_approx
)Let us calculate the approximate 90th percentile of customer CLV to understand high-value customer thresholds.
(percentile_cont)
A continuous percentile is a measure used to determine the value below which a certain percentage of the data falls, based on a continuous interpolation of the data points. In cases where the specified percentile does not correspond exactly to a data point in the dataset, the continuous percentile calculates an interpolated value between the two nearest data points. This provides a more precise estimate of the percentile, especially when dealing with small datasets or when the data distribution is not uniform.
For example, if the 75th percentile falls between two data points, the continuous percentile will estimate a value that represents a weighted average between these points, rather than just picking the closest one. This approach gives a more accurate representation of the distribution, as it takes into account the relative positions of data points rather than simply using discrete ranks. Continuous percentiles are often used in statistical analysis to better understand the distribution of data, especially in situations where the exact percentile may lie between observed values.
The continuous percentile function calculates the exact percentile value by interpolating between the two nearest data points if the specified percentile falls between them. It gives a precise answer by determining a value that may not be in the original dataset but represents a point within the ordered range. This function is used when an exact, interpolated percentile value is needed.
Let us calculate the 75th percentile of Customer Lifetime Value (CLV) to understand the top 25% most valuable customers.
(percentile_disc)
A discrete percentile is a measure used to determine the value below which a specified percentage of the data falls, based on actual data points in the dataset. In contrast to a continuous percentile, which interpolates between data points, a discrete percentile selects the closest actual data value that corresponds to the given percentile rank.
For example, if you want to find the 75th percentile in a discrete approach, the function will choose the value at or just above the rank where 75% of the data points lie, without performing any interpolation. This means that the output will always be one of the actual values from the dataset, making it a straightforward representation of the distribution based on the observed data. Discrete percentiles are useful when the goal is to work with specific data values rather than estimated positions, such as in ranking scenarios or when dealing with ordinal data where interpolation might not be meaningful.
The discrete percentile function calculates the exact percentile value based on the actual data points, without any interpolation. It selects the closest actual data value corresponding to the specified percentile, ensuring that the result is one of the observed values in the dataset. This function is suitable for cases where only actual data values are meaningful, such as ordinal data.
Let us calculate the 90th percentile of engagement scores to find the actual score that separates the top 10% of most engaged customers.
histogram_numeric)
Create a histogram of customer purchase amounts to analyze spending patterns.
This function would return the distribution of customer purchases across 5 buckets, which can be used to create visualizations or perform further analysis.
In the histogram data returned by the query, the x
and y
values represent the following:
x
(Bucket Range): The midpoint or representative value of each bucket in the histogram. In this case, the purchase amounts have been divided into five buckets, so each x
value represents the center of a range of purchase amounts.
y
(Frequency): The number of occurrences (or count) of purchase amounts that fall within each corresponding bucket. This tells you how many purchase transactions fall within the range represented by the x
value.
So, each data point in the JSON array indicates how many purchases (y
) are within a specific range of amounts centered around x
. Together, these values create a histogram showing the distribution of purchase amounts across five intervals.
In the above algorithm, the buckets are determined based on the distribution of the data, not just evenly dividing the range of values. This means that if certain ranges of purchase amounts have more data points, the bucket widths may be adjusted to capture the distribution more accurately, resulting in non-equidistant x
values.
Correlation Coefficient (r) | Interpretation | Description |
---|---|---|
Covariance Value | Interpretation | Description |
---|---|---|
Skewness Value | Description | Interpretation for purchase_amount |
---|---|---|
Kurtosis Value | Description | Interpretation |
---|---|---|