STATSML 602: Techniques for Bot Detection in Data Distiller
Turn clicks into insights: Discover how SQL can reveal bot behavior
Last updated
Turn clicks into insights: Discover how SQL can reveal bot behavior
Last updated
Download the following datasets
Ingest them by following the tutorial for each:
Make sure you have read:
Bots are software applications designed to perform automated tasks over the internet, often at a high frequency and with minimal human intervention. They can be used for a variety of purposes, ranging from beneficial tasks like indexing websites for search engines to malicious activities such as spamming, scraping content, or launching denial-of-service attacks. Bots are typically programmed to mimic human behavior and can be controlled remotely, allowing them to interact with websites, applications, and services just like a human user would, albeit at a much faster and more repetitive pace.
Bots are implemented using scripts or programs that automate specific actions, often through APIs (Application Programming Interfaces) or web automation frameworks like Selenium. Developers use programming languages such as Python, JavaScript, or Java to write bot scripts that simulate clicks, form submissions, or page requests. For complex tasks, bots may incorporate machine learning algorithms to enhance their ability to mimic human-like interactions, avoiding detection by bot-filtering systems. Bot networks, or "botnets," are collections of bots controlled from a central server, enabling large-scale automated activity. While bots are essential for applications like search engines and customer service chatbots, their misuse necessitates robust detection and filtering mechanisms to protect the integrity of online platforms and data.
Bots often produce high-frequency, repetitive actions, while normal users generally produce fewer actions at irregular intervals.
Bot filtering is essential to ensure the integrity and quality of web traffic data. Bots, or non-human interactions, can inflate metrics like page views, clicks, and sessions, leading to inaccurate analytics and poor decision-making. In Adobe Experience Platform, bot filtering can be implemented using SQL within the Query Service, enabling automated detection and filtering of bot-like activity from clickstream data.
Allowing bot activity to infiltrate the Real-Time Customer Data Platform (CDP) or Customer Journey Analytics can significantly degrade the quality and reliability of insights. Bots can generate large volumes of fake interactions, diluting the data used to segment audiences, personalize experiences, and trigger automated actions. This contamination can lead to inaccurate customer profiles, where bots are mistakenly treated as real customers, impacting everything from marketing spend to product recommendations.
Moreover, inflated metrics from bot traffic can lead to incorrect entitlement calculations, potentially resulting in over-licensing issues, which affects cost efficiency. In environments where businesses are charged based on active users or usage volume, bot-induced data can escalate costs, consuming resources allocated for real customers. Overall, bot contamination in a CDP undermines the platform's ability to deliver accurate, actionable insights, compromising the effectiveness of customer engagement strategies and reducing return on investment in marketing and analytics platforms.
However, keeping a copy of bot data on the data lake can be beneficial for several reasons. First, retaining bot data enables teams to continuously refine and improve bot-detection algorithms. By analyzing historical bot behavior, data scientists and engineers can identify evolving patterns and adapt filtering rules, which can enhance future bot filtering and maintain data integrity in real-time analytics environments. Additionally, bot data can serve as a valuable training dataset for machine learning models, which can distinguish between bot and human behavior more accurately over time. For security and compliance teams, archived bot data can provide insights into potential malicious activities, allowing for faster responses to threats and better protection measures. Storing bot data on the data lake also supports compliance, enabling organizations to audit and track how they manage non-human interactions if required. Therefore, while it’s important to filter bot data from production datasets to maintain accurate customer insights, keeping an archived copy on the data lake provides value across analytics, security, and compliance domains.
Bot filtering, anomaly detection, and fraud detection share the common goal of identifying unusual patterns in data, but each serves a distinct purpose. Bot filtering focuses on distinguishing and removing non-human, automated interactions from datasets to ensure that analytics accurately reflect real user behavior. Anomaly detection is a broader process aimed at identifying any unusual or unexpected data points or trends, which may indicate system issues, data errors, or emerging trends. Fraud detection is a specialized type of anomaly detection, specifically designed to identify suspicious and potentially harmful behaviors, such as fraudulent transactions or malicious activities, by detecting complex patterns that are often subtle and well-hidden. While bot filtering primarily relies on rules and thresholds to detect high-frequency, repetitive behaviors typical of bots, anomaly and fraud detection increasingly leverage machine learning models and sophisticated pattern recognition techniques to uncover irregularities. Each method is essential in maintaining data integrity, safeguarding against threats, and enabling more reliable insights across various domains.
A decision tree is a supervised machine learning algorithm used for classification and regression tasks. It operates by recursively splitting data into subsets based on the feature values that provide the best separation. Each internal node represents a decision on a feature, each branch represents the outcome of the decision, and each leaf node represents a final class label or prediction.The algorithm aims to find the most informative features to split the data, maximizing the purity (homogeneity) of the resulting subsets. Popular metrics for these splits include Gini Impurity, Entropy, and Information Gain.
Key Characteristics of Decision Trees:
Simple and Intuitive: Easy to visualize and interpret.
Handles Nonlinear Data: Captures complex relationships between features and labels without requiring feature scaling.
Rule-Based: The hierarchical structure maps directly to logical rules, making them interpretable for domain-specific tasks.
Bot detection typically involves identifying patterns of behavior that distinguish bots from real users. Decision trees are well-suited for this task for several reasons:
Ability to Handle Mixed Data: Bot detection often involves both numerical features (e.g., counts of actions per interval) and categorical features (e.g., action types). Decision trees can natively handle both types of data without requiring feature transformations.
Explainability: A decision tree provides clear, rule-based decisions that can be interpreted easily. For example, a rule like "If actions in 1 minute > 60 AND actions in 30 minutes < 500, then it's a bot" aligns with how bots exhibit distinct patterns in clickstream data.
Effective Feature Selection: In bot detection, not all features are equally important. Decision trees prioritize the most informative features, such as the frequency and intensity of actions. This makes them efficient for identifying bots based on behavioral thresholds.
Handles Nonlinear Relationships: Bots often exhibit nonlinear patterns in their behavior, such as a sudden spike in activity over a short interval. Decision trees can effectively model such relationships, unlike linear models that assume a straight-line relationship.
Adaptability to Imbalanced Data: While imbalanced data is a challenge for most algorithms, decision trees can mitigate this by prioritizing splits that maximize purity (e.g., separating bots from non-bots).
Suitability for Rule-Based Domains: In contexts like bot detection, domain experts often have predefined rules or thresholds. Decision trees align naturally with such rule-based systems, allowing experts to validate or refine the model.
For a dataset with features like:
count_1_min
: Actions in 1-minute intervals.
count_5_mins
: Actions in 5-minute intervals.
count_30_mins
: Actions in 30-minute intervals.
A decision tree might generate rules like:
If count_1_min
> 60 and count_5_mins
> 200 → Bot.
If count_1_min
< 20 and count_30_mins
> 700 → Bot.
Else → Non-Bot.
Such thresholds are highly interpretable and directly actionable, making decision trees an ideal choice for detecting anomalous bot-like behavior in user activity logs.
The feature strategy for bot detection involves aggregating click activity across different time intervals to capture patterns indicative of non-human behavior. Specifically, the data is grouped and counted based on one-minute, five-minute, and thirty-minute intervals, which helps identify high-frequency click patterns over both short and extended durations. In this approach, users with an unusually high number of clicks within each interval—up to 60 clicks in one minute, 300 clicks in five minutes, and 1800 clicks in 30 minutes—are flagged as potential bots. By structuring the data this way, we can detect bursts of activity that exceed typical human behavior, regardless of the interval length. The results are stored in a nested dataframe format, with each user's activity count grouped by timestamp, user ID, and webpage name, providing a rich dataset for training and evaluating machine learning models. This multi-interval aggregation allows us to capture nuanced bot activity patterns that may be missed by a single static threshold, making bot detection more accurate and adaptable.
First, we’ll write a simple query to identify all ids that have generated 50 events within a 60-second interval, or one minute.
The results will be:
If you have ingested Adobe Analytics Data as in the tutorial here - the above query would be very similar to what you would execute. Here is the query that you would have run:
The result would be:
The 1-minute, 5-minute, and 30-minute count features provide valuable insights into short-term, mid-term, and longer-term activity patterns, which are useful for identifying bot-like behavior. Bots often exhibit high-frequency actions in short periods, while genuine users are likely to have lower and more varied activity over time. However, these time-based counts alone might not fully capture the nuances of bot behavior. Here are some additional features that could enhance the model's ability to detect bots:
Unique Action Types per Interval: Count the unique actions (e.g., clicks, page views, add-to-cart) performed in each interval. Bots may perform repetitive actions, so a low number of unique actions per interval could be a strong bot indicator.
Average Time Between Actions: Calculate the average time gap between consecutive actions for each user. Bots tend to have very consistent or minimal time gaps between actions, while human users have more variability.
Standard Deviation of Action Counts Across Intervals: Instead of just using the maximum counts, analyze the standard deviation of action counts within each interval type (1-minute, 5-minute, 30-minute). Low variability may indicate bot behavior, as bots often have more uniform activity patterns.
Session Duration: Measure the time between the first and last action within a session. Bots may have unusually long or short sessions compared to typical user sessions.
Action Sequence Patterns: Look for specific sequences of actions, like "pageView -> addToCart -> purchase" or repetitive patterns (e.g., repeated "click" actions). Certain sequences or repetitions can be strong indicators of scripted bot behavior.
Frequency of Rare Actions: Identify rare actions (e.g., "logout" or "purchase") and check if the frequency of these actions is unusually high. Bots might disproportionately use or avoid certain actions that are less frequent among typical users.
Clickstream Entropy: Calculate entropy on the sequence of actions for each user. High entropy (more randomness) could indicate a human user, while low entropy (predictable patterns) might suggest automated behavior.
Time of Day Patterns: Track actions by time of day. Bots might operate at times when human activity is typically lower, such as very late at night or early morning.
Location or IP Address: If the dataset includes location or IP data, unusual patterns like multiple user IDs with the same IP or multiple sessions from the same location could be signs of bot activity.
Number of Sessions per User: If available, the number of separate sessions per user within a day or week could indicate bots, as bots might operate continuously or have unusually high session counts.
Integrating these features into the model could improve its ability to distinguish bots from genuine users by adding context around activity patterns, user behavior, and usage variations. They would also help address any blind spots in the current model, especially where bot behavior is more complex than just high frequency within short time intervals.
Let us use a combination of patterns and thresholds across the three different time intervals (count_1_min
, count_5_mins
, and count_30_mins
). Here are complex rules we will implement:
Burst Pattern: A bot-like burst pattern that has high activity over shorter intervals and moderate activity over longer intervals.
Sustained High Activity: Bots that sustain high activity across all intervals.
Short-Term Peaks with Long-Term Low Activity: Bots that peak within short intervals but have lower overall long-term activity, indicating possible bursty or periodic automation.
Short and Medium Bursts with Occasional High Long-Term Activity: Users with moderate short- and medium-term activity but extreme spikes over longer intervals, which could indicate periodic scripted automation.
Fluctuating Activity: Bots that exhibit very high activity in one interval but comparatively low activity in others. This can capture erratic or adaptive bots.
Regular Intervals with Low Intensity: Bots that perform fewer actions but consistently over set intervals, indicating periodic scraping or data polling.
Continuous Background Activity: Bots that run continuously but without peaks in short bursts, which might indicate a less aggressive but consistent bot process.
Now let us create the feature set:
This produces the result:
The three time-based aggregation features used in this bot detection query—max_count_1_min
, max_count_5_mins
, and max_count_30_mins
—each serve a unique purpose in capturing different patterns of potential bot behavior:
1-Minute Count (max_count_1_min
): This feature reflects the highest count of actions a user performs within any single 1-minute interval. High action counts in this short timeframe often indicate rapid, automated interactions that exceed typical human behavior. Bots that operate in quick bursts will tend to show elevated values here, helping to detect sudden spikes in activity.
5-Minute Count (max_count_5_mins
): This feature captures mid-term activity by aggregating user actions over a 5-minute period. Bots may not always maintain extreme activity levels in short intervals, but they may show persistent, above-average activity across mid-term intervals. The max_count_5_mins
feature helps detect bots that modulate their activity, slowing down slightly to mimic human behavior but still maintaining an overall high rate of interaction compared to genuine users.
30-Minute Count (max_count_30_mins
): The 30-minute interval allows for detecting long-term activity patterns. Bots, especially those performing continuous or background tasks, may exhibit sustained interaction levels over longer periods. This feature helps to identify scripts or automated processes that maintain a steady, high frequency of activity over time, which would be uncommon for human users.
Each of these features—1-minute, 5-minute, and 30-minute action counts—provides a view into distinct time-based behavioral patterns that help distinguish bots from human users. By combining these features and applying complex detection rules, the model can capture a wider variety of bot-like behaviors, from rapid bursts to prolonged engagement, making it more robust against different types of automated interactions.
To compute the ratio of bots to non-bots in the above result, you can use a simple SQL query that calculates the count of bots and non-bots, then computes their ratio. Here’s how to do it:
Count Bots and Non-Bots: Use a CASE
statement to classify each user as a bot or non-bot based on the isBot
flag.
Calculate the Ratio: Use the bot and non-bot counts to calculate the bot-to-non-bot ratio.
The result will be:
In bot detection, the distribution of bots versus non-bots in the dataset plays a critical role in the model’s effectiveness. If the dataset is imbalanced like above— where non-bot data far outweighs bot data — the model may struggle to recognize bot-like behavior accurately, leading to a bias toward labeling most activity as non-bot. Conversely, a balanced dataset — where both bots and non-bots are equally represented — can help the model learn the distinct patterns of bot behavior more effectively.
In real-world data, bots typically represent a small fraction of total interactions, resulting in an imbalanced dataset. This imbalance can lead to several challenges:
Bias Toward Non-Bot Predictions: The model may default to labeling most users as non-bots, as it has far more examples of non-bot behavior. This can result in a high number of false negatives, where bots are misclassified as non-bots.
Misleading Metrics: Accuracy alone can be misleading in an imbalanced dataset. For instance, if bots make up only 5% of the data, a model could achieve 95% accuracy by predicting "non-bot" every time. This accuracy doesn’t reflect the model's ability to actually detect bots.
Reduced Sensitivity for Bots: Imbalance reduces the model's exposure to bot patterns, making it harder to achieve strong recall for bot detection. In this context, recall is crucial, as we want the model to correctly identify as many bots as possible.
To address imbalanced data in bot detection, various strategies can be employed:
Resampling: Increasing the representation of bot data by oversampling bots or undersampling non-bots can help balance the dataset.
Synthetic Data Generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be used to create synthetic examples of bot behavior, enriching the model’s understanding of bot patterns.
In an ideal setting, having a balanced dataset with equal representation of bots and non-bots enables the model to recognize both classes well. This balance helps the model capture both bot and non-bot behavior accurately, leading to better performance across precision, recall, and overall accuracy. However, achieving a balanced dataset in bot detection can be challenging due to the naturally low prevalence of bots in most datasets.
For our bot detection use case, balancing the dataset or addressing the imbalance is essential to improve the model's recall and precision in identifying bot behavior. Without handling imbalance, the model may fail to detect bots effectively, resulting in contaminated data insights that impact customer segmentation, personalization, and analytics. By using techniques to balance or adjust for the imbalance in bot and non-bot data, the model becomes better equipped to accurately classify bot activity, thus enhancing data quality and ensuring more reliable insights for business decisions.
A decision tree learns boundaries from training data that represent various patterns of bot versus non-bot activity. Unlike a strict threshold rule, the tree can accommodate complex patterns and combinations of high/low activity across different time intervals that are more predictive of bot behavior.
The result will be:
The SQL TRANSFORM
clause enables streamlined feature engineering and preprocessing for machine learning.
numeric_imputer
The numeric_imputer
transformer handles missing values in numerical features, ensuring that no data points are lost due to null values. By imputing missing values, this step maintains data integrity and ensures robust model training.
Example:
max_count_1_min
is imputed using the mean value of the column.
max_count_5_mins
is imputed using the mode (most frequent value).
max_count_30_mins
is imputed using the mean.
string_imputer
The string_imputer
replaces missing values in categorical features with a default value, such as '
unknown
'
, to ensure consistency in the dataset. This step avoids dropping records due to missing categories, a common occurrence in user identifiers or other text-based features.
Example:id
(user identifier) is imputed with '
unknown
'
.
string_indexer
The string_indexer
encodes categorical features into numeric indices, making them compatible with machine learning algorithms. This transformation is crucial for models like decision trees, which do not natively handle categorical data.
Example: The imputed id
feature is converted into a numeric index as si_id
.
quantile_discretizer
The quantile_discretizer
converts continuous numerical features into discrete buckets based on quantiles. This allows the model to better capture non-linear patterns and handle a wider range of value distributions in the data.
Example:
max_count_5_mins
is discretized into buckets (buckets_five
).
max_count_30_mins
is discretized into buckets (buckets_thirty
).
vector_assembler
The vector_assembler
combines all preprocessed features, including encoded categorical features and imputed/discretized numerical features, into a single feature vector. This unified representation is used as input for the decision tree model.
Example: The transformer combines si_id
, imputed_one_minute
, buckets_five
, and buckets_thirty
into a single vector called features
.
min_max_scaler
The min_max_scaler
scales the combined feature vector to a normalized range, typically 0 to 1. This standardization ensures that all features contribute equally to the model training process, avoiding bias caused by differing feature scales.
Example: The features
vector is transformed into scaled_features
to enhance model performance.
These feature transformers work together to preprocess the raw data into a structured and normalized format suitable for training a Decision Tree Classifier. By effectively handling both categorical and numerical features, these transformations improve model accuracy and interpretability, making them an essential step in the pipeline for detecting bot activity.
When evaluating this model, the primary goal is to test its ability to classify users as bots or non-bots based on their activity patterns. Specifically, check if the model correctly predicts the isBot
label (1 for bots, 0 for non-bots) based on the time-based aggregation features. You’re looking for the model to generalize well – meaning it should identify bot-like behavior in new, unseen data, not just replicate rules.
Overfitting is common when working with synthetic data, especially in scenarios where the data generation process is simplified and highly structured. In synthetic datasets, patterns can often be overly consistent or lack the nuanced variability found in real-world data. For instance, if synthetic data strictly follows fixed rules or thresholds without incorporating randomness or exceptions, the model can easily "memorize" these patterns, resulting in high accuracy on the synthetic data but poor generalization on real data.
This overfitting happens because machine learning models are sensitive to the underlying distribution of the training data. When synthetic data doesn’t capture the full diversity of real-world behaviors, models may learn to recognize only the specific patterns present in the training set, rather than generalize to similar yet slightly different patterns. In the context of bot detection, synthetic data might include very clear thresholds for bot-like behavior (such as high click counts in short intervals), which may not represent the subtleties of real bot or human interactions online.
To mitigate this, introducing noise, variability, and probabilistic elements into the synthetic dataset can help mimic the diversity of real-world data, reducing the likelihood of overfitting and making the model evaluation metrics more realistic. By adding controlled randomness and probabilistic labeling, we create a training and testing environment that encourages the model to generalize rather than memorize specific rules.
Let us evaluate the model against test data:
The result will be:
This perfect score suggests that the synthetic nature of our test data is likely the main cause.
The result will be:
TherawPrediction
and probability
columns are NULL by design and will be enhanced in the future.
There are numerous instances of bot mislabeling throughout. When we evaluate the model (just change model_predict
to model_evaluate
in the SQL code above) on this dataset, the results will reflect the following:
The evaluation results here indicate a relatively low area under the ROC curve (AUC-ROC) of 0.47, with an accuracy of 0.586, precision of approximately 0.76, and recall of 0.586. These values suggest that the model has some capability to identify bots but lacks robustness and generalization.
The imbalanced bot-to-non-bot ratio in the training data, at 26 bots to 774 non-bots, is likely a significant factor contributing to this outcome. In cases where the dataset is highly skewed towards one class, like non-bots, models tend to struggle to learn effective patterns to identify the minority class—in this case, bots. As a result:
AUC-ROC being close to 0.5 suggests the model's classification performance is close to random, which is typical when a model is trained on imbalanced data.
Precision at 0.76 shows that when the model predicts a bot, it’s correct 76% of the time. This might reflect that the model is somewhat conservative in predicting bots, potentially due to the overwhelming majority of non-bots in the training data.
Recall of 0.586 indicates that the model only captures about 58.6% of actual bots, likely missing many due to insufficient learning from the minority class.
To improve performance, especially for recall, it might be necessary to either oversample the bot instances or undersample the non-bots in the training data.
SMOTE (Synthetic Minority Oversampling Technique) is a widely used method in machine learning to address the problem of imbalanced datasets. In imbalanced datasets, one class (often the minority class) has significantly fewer examples than the other class (majority class). This imbalance can lead to biased models that perform poorly on the minority class, as the model tends to favor the majority class.
SMOTE generates synthetic samples for the minority class by interpolating between existing data points. Instead of merely duplicating existing data, SMOTE creates new samples along the line segments joining neighboring minority class examples in feature space. This approach enhances the model's ability to generalize by introducing variability and richness to the minority class.
SMOTE is inherently a geometric algorithm that operates in high-dimensional feature space. Its core steps involve:
Identifying nearest neighbors: For each minority class sample, find k-nearest neighbors in feature space.
Generating synthetic samples: Randomly interpolate between the original sample and one of its neighbors.
These steps pose significant challenges in SQL, which is optimized for relational data processing and not for complex geometric operations. Specific difficulties include:
Nearest Neighbor Calculations: SQL does not natively support efficient operations like distance computations (e.g., Euclidean distance) required to identify neighbors.
Interpolation in High Dimensions: Generating synthetic samples requires linear algebra operations, which are not inherently supported in SQL.
Scalability: SMOTE's complexity increases with the dimensionality of the data and the size of the minority class. Implementing these operations in SQL can result in performance bottlenecks.
Although exact SMOTE is challenging in SQL, an approximation can be effective for certain types of data, especially when:
Features are structured: If the dataset has well-defined features with clear bounds (e.g., counts or categories), random noise-based interpolation can mimic SMOTE's synthetic generation.
Minority class is clearly defined: By focusing on generating variations of minority samples using domain-specific rules, we can approximate synthetic oversampling.
Use case involves low-dimensional data: In cases where the feature space is low-dimensional (e.g., 3-5 features), simpler interpolation techniques can achieve similar results.
An SQL-based approximation typically involves:
Duplicating minority samples: This ensures the minority class is represented adequately in the training data.
Adding controlled random noise: Slight variations in the feature values simulate interpolation while remaining computationally feasible in SQL.
The result of the SELECT
query above is:
Execute the following to train the model on the feature dataset we generated above:
Now if we do an evaluate on the inference data: