STATSML 602: Techniques for Bot Detection in Data Distiller

Turn clicks into insights: Discover how SQL can reveal bot behavior

Prerequisites

Download the following datasets

8MB

training_web_data.csv

2MB

test_web_data.csv

10MB

inference_web_data.csv

Ingest them by following the tutorial for each:

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Make sure you have read:

STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models

Overview

Bots are software applications designed to perform automated tasks over the internet, often at a high frequency and with minimal human intervention. They can be used for a variety of purposes, ranging from beneficial tasks like indexing websites for search engines to malicious activities such as spamming, scraping content, or launching denial-of-service attacks. Bots are typically programmed to mimic human behavior and can be controlled remotely, allowing them to interact with websites, applications, and services just like a human user would, albeit at a much faster and more repetitive pace.

Bots are implemented using scripts or programs that automate specific actions, often through APIs (Application Programming Interfaces) or web automation frameworks like Selenium. Developers use programming languages such as Python, JavaScript, or Java to write bot scripts that simulate clicks, form submissions, or page requests. For complex tasks, bots may incorporate machine learning algorithms to enhance their ability to mimic human-like interactions, avoiding detection by bot-filtering systems. Bot networks, or "botnets," are collections of bots controlled from a central server, enabling large-scale automated activity. While bots are essential for applications like search engines and customer service chatbots, their misuse necessitates robust detection and filtering mechanisms to protect the integrity of online platforms and data.

Why Bot Filtering Matters: Protecting Data Quality and Driving Accurate Insights

Bots often produce high-frequency, repetitive actions, while normal users generally produce fewer actions at irregular intervals.

Bot filtering is essential to ensure the integrity and quality of web traffic data. Bots, or non-human interactions, can inflate metrics like page views, clicks, and sessions, leading to inaccurate analytics and poor decision-making. In Adobe Experience Platform, bot filtering can be implemented using SQL within the Query Service, enabling automated detection and filtering of bot-like activity from clickstream data.

Allowing bot activity to infiltrate the Real-Time Customer Data Platform (CDP) or Customer Journey Analytics can significantly degrade the quality and reliability of insights. Bots can generate large volumes of fake interactions, diluting the data used to segment audiences, personalize experiences, and trigger automated actions. This contamination can lead to inaccurate customer profiles, where bots are mistakenly treated as real customers, impacting everything from marketing spend to product recommendations.

Moreover, inflated metrics from bot traffic can lead to incorrect entitlement calculations, potentially resulting in over-licensing issues, which affects cost efficiency. In environments where businesses are charged based on active users or usage volume, bot-induced data can escalate costs, consuming resources allocated for real customers. Overall, bot contamination in a CDP undermines the platform's ability to deliver accurate, actionable insights, compromising the effectiveness of customer engagement strategies and reducing return on investment in marketing and analytics platforms.

However, keeping a copy of bot data on the data lake can be beneficial for several reasons. First, retaining bot data enables teams to continuously refine and improve bot-detection algorithms. By analyzing historical bot behavior, data scientists and engineers can identify evolving patterns and adapt filtering rules, which can enhance future bot filtering and maintain data integrity in real-time analytics environments. Additionally, bot data can serve as a valuable training dataset for machine learning models, which can distinguish between bot and human behavior more accurately over time. For security and compliance teams, archived bot data can provide insights into potential malicious activities, allowing for faster responses to threats and better protection measures. Storing bot data on the data lake also supports compliance, enabling organizations to audit and track how they manage non-human interactions if required. Therefore, while it’s important to filter bot data from production datasets to maintain accurate customer insights, keeping an archived copy on the data lake provides value across analytics, security, and compliance domains.

Bot filtering, anomaly detection, and fraud detection share the common goal of identifying unusual patterns in data, but each serves a distinct purpose. Bot filtering focuses on distinguishing and removing non-human, automated interactions from datasets to ensure that analytics accurately reflect real user behavior. Anomaly detection is a broader process aimed at identifying any unusual or unexpected data points or trends, which may indicate system issues, data errors, or emerging trends. Fraud detection is a specialized type of anomaly detection, specifically designed to identify suspicious and potentially harmful behaviors, such as fraudulent transactions or malicious activities, by detecting complex patterns that are often subtle and well-hidden. While bot filtering primarily relies on rules and thresholds to detect high-frequency, repetitive behaviors typical of bots, anomaly and fraud detection increasingly leverage machine learning models and sophisticated pattern recognition techniques to uncover irregularities. Each method is essential in maintaining data integrity, safeguarding against threats, and enabling more reliable insights across various domains.

Decision Tree Classifier and Bot Detection

A decision tree is a supervised machine learning algorithm used for classification and regression tasks. It operates by recursively splitting data into subsets based on the feature values that provide the best separation. Each internal node represents a decision on a feature, each branch represents the outcome of the decision, and each leaf node represents a final class label or prediction.The algorithm aims to find the most informative features to split the data, maximizing the purity (homogeneity) of the resulting subsets. Popular metrics for these splits include Gini Impurity, Entropy, and Information Gain.

Key Characteristics of Decision Trees:

Simple and Intuitive: Easy to visualize and interpret.
Handles Nonlinear Data: Captures complex relationships between features and labels without requiring feature scaling.
Rule-Based: The hierarchical structure maps directly to logical rules, making them interpretable for domain-specific tasks.

Bot detection typically involves identifying patterns of behavior that distinguish bots from real users. Decision trees are well-suited for this task for several reasons:

Ability to Handle Mixed Data: Bot detection often involves both numerical features (e.g., counts of actions per interval) and categorical features (e.g., action types). Decision trees can natively handle both types of data without requiring feature transformations.
Explainability: A decision tree provides clear, rule-based decisions that can be interpreted easily. For example, a rule like "If actions in 1 minute > 60 AND actions in 30 minutes < 500, then it's a bot" aligns with how bots exhibit distinct patterns in clickstream data.
Effective Feature Selection: In bot detection, not all features are equally important. Decision trees prioritize the most informative features, such as the frequency and intensity of actions. This makes them efficient for identifying bots based on behavioral thresholds.
Handles Nonlinear Relationships: Bots often exhibit nonlinear patterns in their behavior, such as a sudden spike in activity over a short interval. Decision trees can effectively model such relationships, unlike linear models that assume a straight-line relationship.
Adaptability to Imbalanced Data: While imbalanced data is a challenge for most algorithms, decision trees can mitigate this by prioritizing splits that maximize purity (e.g., separating bots from non-bots).
Suitability for Rule-Based Domains: In contexts like bot detection, domain experts often have predefined rules or thresholds. Decision trees align naturally with such rule-based systems, allowing experts to validate or refine the model.

Example in the Context of Bot Detection

For a dataset with features like:

count_1_min: Actions in 1-minute intervals.
count_5_mins: Actions in 5-minute intervals.
count_30_mins: Actions in 30-minute intervals.

A decision tree might generate rules like:

If count_1_min > 60 and count_5_mins > 200 → Bot.
If count_1_min < 20 and count_30_mins > 700 → Bot.
Else → Non-Bot.

Such thresholds are highly interpretable and directly actionable, making decision trees an ideal choice for detecting anomalous bot-like behavior in user activity logs.

Designing Features to Detect Bot Activity

The feature strategy for bot detection involves aggregating click activity across different time intervals to capture patterns indicative of non-human behavior. Specifically, the data is grouped and counted based on one-minute, five-minute, and thirty-minute intervals, which helps identify high-frequency click patterns over both short and extended durations. In this approach, users with an unusually high number of clicks within each interval—up to 60 clicks in one minute, 300 clicks in five minutes, and 1800 clicks in 30 minutes—are flagged as potential bots. By structuring the data this way, we can detect bursts of activity that exceed typical human behavior, regardless of the interval length. The results are stored in a nested dataframe format, with each user's activity count grouped by timestamp, user ID, and webpage name, providing a rich dataset for training and evaluating machine learning models. This multi-interval aggregation allows us to capture nuanced bot activity patterns that may be missed by a single static threshold, making bot detection more accurate and adaptable.

First, we’ll write a simple query to identify all ids that have generated 50 events within a 60-second interval, or one minute.

SELECT *
FROM training_web_data
WHERE id NOT IN (
    SELECT id
    FROM bot_web_data
    GROUP BY 
        UNIX_TIMESTAMP(timestamp) / 60,
        id
    HAVING COUNT(*) > 50
);

The results will be:

If you have ingested Adobe Analytics Data as in the tutorial here - the above query would be very similar to what you would execute. Here is the query that you would have run:

SELECT *
FROM luma_web_data
WHERE enduserids._experience.mcid NOT IN (
    SELECT enduserids._experience.mcid
    FROM luma_web_data
    GROUP BY 
        Unix_timestamp(timestamp) / 60,
        enduserids._experience.mcid
    HAVING COUNT(*) > 50);

The result would be:

The 1-minute, 5-minute, and 30-minute count features provide valuable insights into short-term, mid-term, and longer-term activity patterns, which are useful for identifying bot-like behavior. Bots often exhibit high-frequency actions in short periods, while genuine users are likely to have lower and more varied activity over time. However, these time-based counts alone might not fully capture the nuances of bot behavior. Here are some additional features that could enhance the model's ability to detect bots:

Unique Action Types per Interval: Count the unique actions (e.g., clicks, page views, add-to-cart) performed in each interval. Bots may perform repetitive actions, so a low number of unique actions per interval could be a strong bot indicator.
Average Time Between Actions: Calculate the average time gap between consecutive actions for each user. Bots tend to have very consistent or minimal time gaps between actions, while human users have more variability.
Standard Deviation of Action Counts Across Intervals: Instead of just using the maximum counts, analyze the standard deviation of action counts within each interval type (1-minute, 5-minute, 30-minute). Low variability may indicate bot behavior, as bots often have more uniform activity patterns.
Session Duration: Measure the time between the first and last action within a session. Bots may have unusually long or short sessions compared to typical user sessions.
Action Sequence Patterns: Look for specific sequences of actions, like "pageView -> addToCart -> purchase" or repetitive patterns (e.g., repeated "click" actions). Certain sequences or repetitions can be strong indicators of scripted bot behavior.
Frequency of Rare Actions: Identify rare actions (e.g., "logout" or "purchase") and check if the frequency of these actions is unusually high. Bots might disproportionately use or avoid certain actions that are less frequent among typical users.
Clickstream Entropy: Calculate entropy on the sequence of actions for each user. High entropy (more randomness) could indicate a human user, while low entropy (predictable patterns) might suggest automated behavior.
Time of Day Patterns: Track actions by time of day. Bots might operate at times when human activity is typically lower, such as very late at night or early morning.
Location or IP Address: If the dataset includes location or IP data, unusual patterns like multiple user IDs with the same IP or multiple sessions from the same location could be signs of bot activity.
Number of Sessions per User: If available, the number of separate sessions per user within a day or week could indicate bots, as bots might operate continuously or have unusually high session counts.

Integrating these features into the model could improve its ability to distinguish bots from genuine users by adding context around activity patterns, user behavior, and usage variations. They would also help address any blind spots in the current model, especially where bot behavior is more complex than just high frequency within short time intervals.

Rule-Based Annotation for Training Data Labeling with Data Distiller

Let us use a combination of patterns and thresholds across the three different time intervals (count_1_min, count_5_mins, and count_30_mins). Here are complex rules we will implement:

Multi-Interval Threshold Combinations

Burst Pattern: A bot-like burst pattern that has high activity over shorter intervals and moderate activity over longer intervals.

CASE
   WHEN MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500 THEN 1
   ELSE 0
END AS isBot

Sustained High Activity: Bots that sustain high activity across all intervals.

CASE
   WHEN MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800 THEN 1
   ELSE 0
END AS isBot

Short-Term Peaks with Long-Term Low Activity: Bots that peak within short intervals but have lower overall long-term activity, indicating possible bursty or periodic automation.
```
CASE
   WHEN MAX(count_1_min) > 70 AND MAX(count_5_mins) < 150 AND MAX(count_30_mins) < 300 THEN 1
   ELSE 0
END AS isBot
```

Patterned Activity with Anomalous Long-Term Spikes

Short and Medium Bursts with Occasional High Long-Term Activity: Users with moderate short- and medium-term activity but extreme spikes over longer intervals, which could indicate periodic scripted automation.
```
CASE
   WHEN MAX(count_1_min) BETWEEN 30 AND 60 
        AND MAX(count_5_mins) BETWEEN 150 AND 250 
        AND MAX(count_30_mins) > 1000 THEN 1
   ELSE 0
END AS isBot
```

Inconsistent High Activity Over Varying Intervals

Fluctuating Activity: Bots that exhibit very high activity in one interval but comparatively low activity in others. This can capture erratic or adaptive bots.

CASE
   WHEN (MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
        OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400) THEN 1
   ELSE 0
END AS isBot

Periodic Low-Frequency Bots

Regular Intervals with Low Intensity: Bots that perform fewer actions but consistently over set intervals, indicating periodic scraping or data polling.

CASE
   WHEN MAX(count_1_min) BETWEEN 10 AND 30 
        AND MAX(count_5_mins) BETWEEN 50 AND 100 
        AND MAX(count_30_mins) BETWEEN 150 AND 300 THEN 1
   ELSE 0
END AS isBot

High Long-Term Activity with Low Short-Term Activity

Continuous Background Activity: Bots that run continuously but without peaks in short bursts, which might indicate a less aggressive but consistent bot process.

CASE
   WHEN MAX(count_1_min) < 20 
        AND MAX(count_5_mins) < 100 
        AND MAX(count_30_mins) > 700 THEN 1
   ELSE 0
END AS isBot

Build the Feature Set

Now let us create the feature set:

-- Step 1: Count actions in each interval and calculate max counts
WITH count_1_min AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
        COUNT(*) AS count_1_min
    FROM training_web_data
    GROUP BY id, interval
),

count_5_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
        COUNT(*) AS count_5_mins
    FROM training_web_data
    GROUP BY id, interval
),

count_30_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
        COUNT(*) AS count_30_mins
    FROM training_web_data
    GROUP BY id, interval
),

-- Step 2: Consolidate counts for each user by merging the counts from each interval
consolidated_counts AS (
    SELECT 
        COALESCE(c1.id, c5.id, c30.id) AS id,
        COALESCE(c1.count_1_min, 0) AS count_1_min,
        COALESCE(c5.count_5_mins, 0) AS count_5_mins,
        COALESCE(c30.count_30_mins, 0) AS count_30_mins
    FROM count_1_min c1
    FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
    FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
),

-- Step 3: Calculate max counts per interval per user with complex bot detection rules
final_features AS (
    SELECT
        id,
        MAX(count_1_min) AS max_count_1_min,
        MAX(count_5_mins) AS max_count_5_mins,
        MAX(count_30_mins) AS max_count_30_mins,
        CASE
            -- Complex bot detection rules
            WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500)
                 OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800)
                 OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
                 OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
                     OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400))
                 OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
                 OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700)
            THEN 1
            ELSE 0
        END AS isBot
    FROM consolidated_counts
    GROUP BY id
)

-- Step 4: Select the final feature set with bot labels
SELECT
    id,
    max_count_1_min,
    max_count_5_mins,
    max_count_30_mins,
    isBot
FROM final_features;

This produces the result:

The three time-based aggregation features used in this bot detection query—max_count_1_min, max_count_5_mins, and max_count_30_mins—each serve a unique purpose in capturing different patterns of potential bot behavior:

1-Minute Count (max_count_1_min): This feature reflects the highest count of actions a user performs within any single 1-minute interval. High action counts in this short timeframe often indicate rapid, automated interactions that exceed typical human behavior. Bots that operate in quick bursts will tend to show elevated values here, helping to detect sudden spikes in activity.
5-Minute Count (max_count_5_mins): This feature captures mid-term activity by aggregating user actions over a 5-minute period. Bots may not always maintain extreme activity levels in short intervals, but they may show persistent, above-average activity across mid-term intervals. The max_count_5_mins feature helps detect bots that modulate their activity, slowing down slightly to mimic human behavior but still maintaining an overall high rate of interaction compared to genuine users.
30-Minute Count (max_count_30_mins): The 30-minute interval allows for detecting long-term activity patterns. Bots, especially those performing continuous or background tasks, may exhibit sustained interaction levels over longer periods. This feature helps to identify scripts or automated processes that maintain a steady, high frequency of activity over time, which would be uncommon for human users.

Each of these features—1-minute, 5-minute, and 30-minute action counts—provides a view into distinct time-based behavioral patterns that help distinguish bots from human users. By combining these features and applying complex detection rules, the model can capture a wider variety of bot-like behaviors, from rapid bursts to prolonged engagement, making it more robust against different types of automated interactions.

Bot vs. Non-Bots in Training Data

To compute the ratio of bots to non-bots in the above result, you can use a simple SQL query that calculates the count of bots and non-bots, then computes their ratio. Here’s how to do it:

Count Bots and Non-Bots: Use a CASE statement to classify each user as a bot or non-bot based on the isBot flag.
Calculate the Ratio: Use the bot and non-bot counts to calculate the bot-to-non-bot ratio.

-- Step 1: Count actions in each interval and calculate max counts
WITH count_1_min AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
        COUNT(*) AS count_1_min
    FROM training_web_data
    GROUP BY id, interval
),

count_5_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
        COUNT(*) AS count_5_mins
    FROM training_web_data
    GROUP BY id, interval
),

count_30_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
        COUNT(*) AS count_30_mins
    FROM training_web_data
    GROUP BY id, interval
),

-- Step 2: Consolidate counts for each user by merging the counts from each interval
consolidated_counts AS (
    SELECT 
        COALESCE(c1.id, c5.id, c30.id) AS id,
        COALESCE(c1.count_1_min, 0) AS count_1_min,
        COALESCE(c5.count_5_mins, 0) AS count_5_mins,
        COALESCE(c30.count_30_mins, 0) AS count_30_mins
    FROM count_1_min c1
    FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
    FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
),

-- Step 3: Calculate max counts per interval per user with complex bot detection rules
final_features AS (
    SELECT
        id,
        MAX(count_1_min) AS max_count_1_min,
        MAX(count_5_mins) AS max_count_5_mins,
        MAX(count_30_mins) AS max_count_30_mins,
        CASE
            -- Complex bot detection rules
            WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500)
                 OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800)
                 OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
                 OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
                     OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400))
                 OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
                 OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700)
            THEN 1
            ELSE 0
        END AS isBot
    FROM consolidated_counts
    GROUP BY id
),

-- Step 4: Aggregate bot and non-bot counts
bot_counts AS (
    SELECT
        SUM(CASE WHEN isBot = 1 THEN 1 ELSE 0 END) AS bot_count,
        SUM(CASE WHEN isBot = 0 THEN 1 ELSE 0 END) AS non_bot_count
    FROM final_features
)

-- Step 5: Calculate the bot-to-non-bot ratio and display counts
SELECT
    bot_count,
    non_bot_count,
    bot_count * 1.0 / NULLIF(non_bot_count, 0) AS bot_to_non_bot_ratio
FROM bot_counts;

The result will be:

In bot detection, the distribution of bots versus non-bots in the dataset plays a critical role in the model’s effectiveness. If the dataset is imbalanced like above— where non-bot data far outweighs bot data — the model may struggle to recognize bot-like behavior accurately, leading to a bias toward labeling most activity as non-bot. Conversely, a balanced dataset — where both bots and non-bots are equally represented — can help the model learn the distinct patterns of bot behavior more effectively.

Imbalanced Data in Bot Detection

In real-world data, bots typically represent a small fraction of total interactions, resulting in an imbalanced dataset. This imbalance can lead to several challenges:

Bias Toward Non-Bot Predictions: The model may default to labeling most users as non-bots, as it has far more examples of non-bot behavior. This can result in a high number of false negatives, where bots are misclassified as non-bots.
Misleading Metrics: Accuracy alone can be misleading in an imbalanced dataset. For instance, if bots make up only 5% of the data, a model could achieve 95% accuracy by predicting "non-bot" every time. This accuracy doesn’t reflect the model's ability to actually detect bots.
Reduced Sensitivity for Bots: Imbalance reduces the model's exposure to bot patterns, making it harder to achieve strong recall for bot detection. In this context, recall is crucial, as we want the model to correctly identify as many bots as possible.

To address imbalanced data in bot detection, various strategies can be employed:

Resampling: Increasing the representation of bot data by oversampling bots or undersampling non-bots can help balance the dataset.
Synthetic Data Generation: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be used to create synthetic examples of bot behavior, enriching the model’s understanding of bot patterns.

In an ideal setting, having a balanced dataset with equal representation of bots and non-bots enables the model to recognize both classes well. This balance helps the model capture both bot and non-bot behavior accurately, leading to better performance across precision, recall, and overall accuracy. However, achieving a balanced dataset in bot detection can be challenging due to the naturally low prevalence of bots in most datasets.

For our bot detection use case, balancing the dataset or addressing the imbalance is essential to improve the model's recall and precision in identifying bot behavior. Without handling imbalance, the model may fail to detect bots effectively, resulting in contaminated data insights that impact customer segmentation, personalization, and analytics. By using techniques to balance or adjust for the imbalance in bot and non-bot data, the model becomes better equipped to accurately classify bot activity, thus enhancing data quality and ensuring more reliable insights for business decisions.

Train a Decision Tree Classifier Model

A decision tree learns boundaries from training data that represent various patterns of bot versus non-bot activity. Unlike a strict threshold rule, the tree can accommodate complex patterns and combinations of high/low activity across different time intervals that are more predictive of bot behavior.

DROP MODEL IF EXISTS bot_filtering_model;
-- Define the model with transformations and options
CREATE MODEL bot_filtering_model
TRANSFORM (
    numeric_imputer(max_count_1_min, 'mean') imputed_one_minute,       -- Impute missing values in 1-minute count with mean
    numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute,     -- Impute missing values in 5-minute count with mode
    numeric_imputer(max_count_30_mins) imputed_thirty_minute,          -- Impute missing values in 30-minute count
    string_imputer(id, 'unknown') imputed_id,                          -- Impute missing user IDs as 'unknown'
    string_indexer(imputed_id) si_id,                                  -- Index the ID as a numeric feature
    quantile_discretizer(imputed_five_minute) buckets_five,            -- Discretize the 5-minute feature using quantiles
    quantile_discretizer(imputed_thirty_minute) buckets_thirty,        -- Discretize the 30-minute feature using quantiles
    vector_assembler(array(si_id, imputed_one_minute, buckets_five, buckets_thirty)) features,  -- Assemble all features into a single vector
    min_max_scaler(features) scaled_features                           -- Scale features to be within a range of 0 to 1
) 
OPTIONS (
    MODEL_TYPE = 'decision_tree_classifier',
    MAX_DEPTH = 4,
    LABEL = 'isBot'
) AS

-- Feature Engineering for Training Data
WITH count_1_min AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
        COUNT(*) AS count_1_min
    FROM training_web_data
    GROUP BY id, interval
),

count_5_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
        COUNT(*) AS count_5_mins
    FROM training_web_data
    GROUP BY id, interval
),

count_30_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
        COUNT(*) AS count_30_mins
    FROM training_web_data
    GROUP BY id, interval
),

-- Consolidate counts across different intervals
consolidated_counts AS (
    SELECT 
        COALESCE(c1.id, c5.id, c30.id) AS id,
        COALESCE(c1.count_1_min, 0) AS count_1_min,
        COALESCE(c5.count_5_mins, 0) AS count_5_mins,
        COALESCE(c30.count_30_mins, 0) AS count_30_mins
    FROM count_1_min c1
    FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
    FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
),

-- Calculate max counts per interval per user and apply complex bot detection rules
final_features AS (
    SELECT
        id,
        MAX(count_1_min) AS max_count_1_min,
        MAX(count_5_mins) AS max_count_5_mins,
        MAX(count_30_mins) AS max_count_30_mins,
        CASE
            WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500)
                 OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800)
                 OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
                 OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
                     OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400))
                 OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
                 OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700)
            THEN 1
            ELSE 0
        END AS isBot
    FROM consolidated_counts
    GROUP BY id
)

-- Select features and label for training
SELECT
    max_count_1_min,
    max_count_5_mins,
    max_count_30_mins,
    isBot,
    id
FROM final_features;

The result will be:

Feature Transformers Used for Bot Detection

The SQL TRANSFORM clause enables streamlined feature engineering and preprocessing for machine learning.

`numeric_imputer`

The numeric_imputer transformer handles missing values in numerical features, ensuring that no data points are lost due to null values. By imputing missing values, this step maintains data integrity and ensures robust model training.

Example:

max_count_1_min is imputed using the mean value of the column.
max_count_5_mins is imputed using the mode (most frequent value).
max_count_30_mins is imputed using the mean.

`string_imputer`

The string_imputer replaces missing values in categorical features with a default value, such as 'unknown', to ensure consistency in the dataset. This step avoids dropping records due to missing categories, a common occurrence in user identifiers or other text-based features.

Example:id (user identifier) is imputed with 'unknown'.

`string_indexer`

The string_indexer encodes categorical features into numeric indices, making them compatible with machine learning algorithms. This transformation is crucial for models like decision trees, which do not natively handle categorical data.

Example: The imputed id feature is converted into a numeric index as si_id.

`quantile_discretizer`

The quantile_discretizer converts continuous numerical features into discrete buckets based on quantiles. This allows the model to better capture non-linear patterns and handle a wider range of value distributions in the data.

Example:

max_count_5_mins is discretized into buckets (buckets_five).
max_count_30_mins is discretized into buckets (buckets_thirty).

`vector_assembler`

The vector_assembler combines all preprocessed features, including encoded categorical features and imputed/discretized numerical features, into a single feature vector. This unified representation is used as input for the decision tree model.

Example: The transformer combines si_id, imputed_one_minute, buckets_five, and buckets_thirty into a single vector called features.

`min_max_scaler`

The min_max_scaler scales the combined feature vector to a normalized range, typically 0 to 1. This standardization ensures that all features contribute equally to the model training process, avoiding bias caused by differing feature scales.

Example: The features vector is transformed into scaled_features to enhance model performance.

These feature transformers work together to preprocess the raw data into a structured and normalized format suitable for training a Decision Tree Classifier. By effectively handling both categorical and numerical features, these transformations improve model accuracy and interpretability, making them an essential step in the pipeline for detecting bot activity.

Evaluate the Decision Tree Classifier Model

When evaluating this model, the primary goal is to test its ability to classify users as bots or non-bots based on their activity patterns. Specifically, check if the model correctly predicts the isBot label (1 for bots, 0 for non-bots) based on the time-based aggregation features. You’re looking for the model to generalize well – meaning it should identify bot-like behavior in new, unseen data, not just replicate rules.

Overfitting is common when working with synthetic data, especially in scenarios where the data generation process is simplified and highly structured. In synthetic datasets, patterns can often be overly consistent or lack the nuanced variability found in real-world data. For instance, if synthetic data strictly follows fixed rules or thresholds without incorporating randomness or exceptions, the model can easily "memorize" these patterns, resulting in high accuracy on the synthetic data but poor generalization on real data.

This overfitting happens because machine learning models are sensitive to the underlying distribution of the training data. When synthetic data doesn’t capture the full diversity of real-world behaviors, models may learn to recognize only the specific patterns present in the training set, rather than generalize to similar yet slightly different patterns. In the context of bot detection, synthetic data might include very clear thresholds for bot-like behavior (such as high click counts in short intervals), which may not represent the subtleties of real bot or human interactions online.

To mitigate this, introducing noise, variability, and probabilistic elements into the synthetic dataset can help mimic the diversity of real-world data, reducing the likelihood of overfitting and making the model evaluation metrics more realistic. By adding controlled randomness and probabilistic labeling, we create a training and testing environment that encourages the model to generalize rather than memorize specific rules.

Let us evaluate the model against test data:

-- Model evaluation query using strict rule-based bot detection
SELECT * 
FROM model_evaluate(
    bot_filtering_model,
    1,

    WITH count_1_min AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
            COUNT(*) AS count_1_min  -- Strict count without random offset for 1-minute interval
        FROM test_web_data
        GROUP BY id, interval
    ),

    count_5_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
            COUNT(*) AS count_5_mins  -- Strict count without random offset for 5-minute interval
        FROM test_web_data
        GROUP BY id, interval
    ),

    count_30_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
            COUNT(*) AS count_30_mins  -- Strict count without random offset for 30-minute interval
        FROM test_web_data
        GROUP BY id, interval
    ),

    -- Step 1: Consolidate counts for each user by merging the counts from each interval
    consolidated_counts AS (
        SELECT 
            COALESCE(c1.id, c5.id, c30.id) AS id,
            COALESCE(c1.count_1_min, 0) AS count_1_min,
            COALESCE(c5.count_5_mins, 0) AS count_5_mins,
            COALESCE(c30.count_30_mins, 0) AS count_30_mins
        FROM count_1_min c1
        FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
        FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
    ),

    -- Step 2: Calculate max counts per interval per user with strict rule-based bot detection
    final_features AS (
        SELECT
            id,
            MAX(count_1_min) AS max_count_1_min,
            MAX(count_5_mins) AS max_count_5_mins,
            MAX(count_30_mins) AS max_count_30_mins,
            CASE
                -- Strict bot detection rules without probabilistic elements
                WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500)
                     OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800)
                     OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
                     OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
                         OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400))
                     OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
                     OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700)
                THEN 1
                ELSE 0
            END AS isBot
        FROM consolidated_counts
        GROUP BY id
    )

    -- Step 3: Select the columns with expected names for model evaluation
    SELECT 
        max_count_1_min,
        max_count_5_mins,
        max_count_30_mins,
        isBot,
        id
    FROM final_features
);

The result will be:

This perfect score suggests that the synthetic nature of our test data is likely the main cause.

Predict Using the Decision Tree Classifier Model

-- Model prediction query with more lenient bot-detection thresholds, without added randomness
SELECT * 
FROM model_predict(
    bot_filtering_model,
    1,

    WITH count_1_min AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
            COUNT(*) AS count_1_min
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_5_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
            COUNT(*) AS count_5_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_30_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
            COUNT(*) AS count_30_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    -- Step 1: Consolidate counts for each user by merging the counts from each interval
    consolidated_counts AS (
        SELECT 
            COALESCE(c1.id, c5.id, c30.id) AS id,
            COALESCE(c1.count_1_min, 0) AS count_1_min,
            COALESCE(c5.count_5_mins, 0) AS count_5_mins,
            COALESCE(c30.count_30_mins, 0) AS count_30_mins
        FROM count_1_min c1
        FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
        FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
    ),

    -- Step 2: Calculate max counts per interval per user with more lenient bot detection rules
    final_features AS (
        SELECT
            id,
            MAX(count_1_min) AS max_count_1_min,
            MAX(count_5_mins) AS max_count_5_mins,
            MAX(count_30_mins) AS max_count_30_mins,
            CASE
                -- Modified bot detection rules to be more lenient
                WHEN (MAX(count_1_min) > 40 AND MAX(count_5_mins) BETWEEN 80 AND 150 AND MAX(count_30_mins) < 300)
                     OR (MAX(count_1_min) > 35 AND MAX(count_5_mins) > 180 AND MAX(count_30_mins) > 600)
                     OR (MAX(count_1_min) BETWEEN 25 AND 40 AND MAX(count_5_mins) BETWEEN 120 AND 200 AND MAX(count_30_mins) > 700)
                     OR ((MAX(count_1_min) > 60 AND MAX(count_5_mins) < 90 AND MAX(count_30_mins) > 400) 
                         OR (MAX(count_1_min) < 30 AND MAX(count_5_mins) > 150 AND MAX(count_30_mins) < 300))
                     OR (MAX(count_1_min) BETWEEN 15 AND 30 AND MAX(count_5_mins) BETWEEN 40 AND 80 AND MAX(count_30_mins) BETWEEN 100 AND 200)
                     OR (MAX(count_1_min) < 15 AND MAX(count_5_mins) < 80 AND MAX(count_30_mins) > 500)
                THEN 1
                ELSE 0
            END AS isBot
        FROM consolidated_counts
        GROUP BY id
    )

    -- Step 3: Select the columns with expected names for model prediction
    SELECT 
        max_count_1_min,
        max_count_5_mins,
        max_count_30_mins,
        isBot,
        id
    FROM final_features
);

The result will be:

TherawPrediction and probability columns are NULL by design and will be enhanced in the future.

Diagnosing Issues in Production

There are numerous instances of bot mislabeling throughout. When we evaluate the model (just change model_predictto model_evaluatein the SQL code above) on this dataset, the results will reflect the following:

The evaluation results here indicate a relatively low area under the ROC curve (AUC-ROC) of 0.47, with an accuracy of 0.586, precision of approximately 0.76, and recall of 0.586. These values suggest that the model has some capability to identify bots but lacks robustness and generalization.

The imbalanced bot-to-non-bot ratio in the training data, at 26 bots to 774 non-bots, is likely a significant factor contributing to this outcome. In cases where the dataset is highly skewed towards one class, like non-bots, models tend to struggle to learn effective patterns to identify the minority class—in this case, bots. As a result:

AUC-ROC being close to 0.5 suggests the model's classification performance is close to random, which is typical when a model is trained on imbalanced data.
Precision at 0.76 shows that when the model predicts a bot, it’s correct 76% of the time. This might reflect that the model is somewhat conservative in predicting bots, potentially due to the overwhelming majority of non-bots in the training data.
Recall of 0.586 indicates that the model only captures about 58.6% of actual bots, likely missing many due to insufficient learning from the minority class.

To improve performance, especially for recall, it might be necessary to either oversample the bot instances or undersample the non-bots in the training data.

SQL Approximation of SMOTE (Synthetic Minority Oversampling Technique)

SMOTE (Synthetic Minority Oversampling Technique) is a widely used method in machine learning to address the problem of imbalanced datasets. In imbalanced datasets, one class (often the minority class) has significantly fewer examples than the other class (majority class). This imbalance can lead to biased models that perform poorly on the minority class, as the model tends to favor the majority class.

SMOTE generates synthetic samples for the minority class by interpolating between existing data points. Instead of merely duplicating existing data, SMOTE creates new samples along the line segments joining neighboring minority class examples in feature space. This approach enhances the model's ability to generalize by introducing variability and richness to the minority class.

SMOTE is inherently a geometric algorithm that operates in high-dimensional feature space. Its core steps involve:

Identifying nearest neighbors: For each minority class sample, find k-nearest neighbors in feature space.
Generating synthetic samples: Randomly interpolate between the original sample and one of its neighbors.

These steps pose significant challenges in SQL, which is optimized for relational data processing and not for complex geometric operations. Specific difficulties include:

Nearest Neighbor Calculations: SQL does not natively support efficient operations like distance computations (e.g., Euclidean distance) required to identify neighbors.
Interpolation in High Dimensions: Generating synthetic samples requires linear algebra operations, which are not inherently supported in SQL.
Scalability: SMOTE's complexity increases with the dimensionality of the data and the size of the minority class. Implementing these operations in SQL can result in performance bottlenecks.

Although exact SMOTE is challenging in SQL, an approximation can be effective for certain types of data, especially when:

Features are structured: If the dataset has well-defined features with clear bounds (e.g., counts or categories), random noise-based interpolation can mimic SMOTE's synthetic generation.
Minority class is clearly defined: By focusing on generating variations of minority samples using domain-specific rules, we can approximate synthetic oversampling.
Use case involves low-dimensional data: In cases where the feature space is low-dimensional (e.g., 3-5 features), simpler interpolation techniques can achieve similar results.

An SQL-based approximation typically involves:

Duplicating minority samples: This ensures the minority class is represented adequately in the training data.
Adding controlled random noise: Slight variations in the feature values simulate interpolation while remaining computationally feasible in SQL.

CREATE TABLE new_training_data AS
WITH count_1_min AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
        COUNT(*) AS count_1_min
    FROM training_web_data
    GROUP BY id, interval
),

count_5_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
        COUNT(*) AS count_5_mins
    FROM training_web_data
    GROUP BY id, interval
),

count_30_mins AS (
    SELECT
        id,
        FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
        COUNT(*) AS count_30_mins
    FROM training_web_data
    GROUP BY id, interval
),

-- Consolidate counts across different intervals
consolidated_counts AS (
    SELECT 
        COALESCE(c1.id, c5.id, c30.id) AS id,
        COALESCE(c1.count_1_min, 0) AS count_1_min,
        COALESCE(c5.count_5_mins, 0) AS count_5_mins,
        COALESCE(c30.count_30_mins, 0) AS count_30_mins
    FROM count_1_min c1
    FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
    FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
),

-- Calculate max counts per interval per user with bot detection rules
final_features AS (
    SELECT
        id,
        MAX(count_1_min) AS max_count_1_min,
        MAX(count_5_mins) AS max_count_5_mins,
        MAX(count_30_mins) AS max_count_30_mins,
        CASE
            WHEN (MAX(count_1_min) > 60 AND MAX(count_5_mins) BETWEEN 100 AND 200 AND MAX(count_30_mins) < 500)
                 OR (MAX(count_1_min) > 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) > 800)
                 OR (MAX(count_1_min) BETWEEN 30 AND 60 AND MAX(count_5_mins) BETWEEN 150 AND 250 AND MAX(count_30_mins) > 1000)
                 OR ((MAX(count_1_min) > 80 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 500) 
                     OR (MAX(count_1_min) < 50 AND MAX(count_5_mins) > 200 AND MAX(count_30_mins) < 400))
                 OR (MAX(count_1_min) BETWEEN 10 AND 30 AND MAX(count_5_mins) BETWEEN 50 AND 100 AND MAX(count_30_mins) BETWEEN 150 AND 300)
                 OR (MAX(count_1_min) < 20 AND MAX(count_5_mins) < 100 AND MAX(count_30_mins) > 700)
            THEN 1
            ELSE 0
        END AS isBot
    FROM consolidated_counts
    GROUP BY id
),

-- Step 2: Extract minority class (isBot = 1)
bot_records AS (
    SELECT *
    FROM final_features
    WHERE isBot = 1
),

-- Step 3: Generate synthetic samples for the minority class
synthetic_bot_samples AS (
    SELECT
        ROW_NUMBER() OVER (ORDER BY id) + FLOOR(RAND() * 1000) AS id,  -- Generate new synthetic IDs
        max_count_1_min + (RAND() * 10 - 5) AS max_count_1_min,  -- Add random noise within ±5
        max_count_5_mins + (RAND() * 20 - 10) AS max_count_5_mins,  -- Add random noise within ±10
        max_count_30_mins + (RAND() * 30 - 15) AS max_count_30_mins,  -- Add random noise within ±15
        1 AS isBot  -- Keep the bot label
    FROM bot_records
),

-- Step 4: Combine original data with synthetic samples
balanced_training_data AS (
    SELECT * FROM final_features
    UNION ALL
    SELECT * FROM synthetic_bot_samples
)

SELECT * FROM balanced_training_data;

The result of the SELECT query above is:

Execute the following to train the model on the feature dataset we generated above:

DROP MODEL IF EXISTS bot_filtering_model;

-- Define the model with transformations and options
CREATE MODEL bot_filtering_model
TRANSFORM (
    numeric_imputer(max_count_1_min, 'mean') imputed_one_minute,       -- Impute missing values in 1-minute count with mean
    numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute,     -- Impute missing values in 5-minute count with mode
    numeric_imputer(max_count_30_mins) imputed_thirty_minute,          -- Impute missing values in 30-minute count
    string_imputer(id, 'unknown') imputed_id,                          -- Impute missing user IDs as 'unknown'
    string_indexer(imputed_id) si_id,                                  -- Index the ID as a numeric feature
    quantile_discretizer(imputed_five_minute) buckets_five,            -- Discretize the 5-minute feature using quantiles
    quantile_discretizer(imputed_thirty_minute) buckets_thirty,        -- Discretize the 30-minute feature using quantiles
    vector_assembler(array(si_id, imputed_one_minute, buckets_five, buckets_thirty)) features,  -- Assemble all features into a single vector
    min_max_scaler(features) scaled_features                           -- Scale features to be within a range of 0 to 1
) 
OPTIONS (
    MODEL_TYPE = 'decision_tree_classifier',
    MAX_DEPTH = 4,
    LABEL = 'isBot'
) AS
SELECT
    max_count_1_min,
    max_count_5_mins,
    max_count_30_mins,
    isBot,
    id
FROM new_training_data;

Now if we do an evaluate on the inference data:

-- Model prediction query with more lenient bot-detection thresholds, without added randomness
SELECT * 
FROM model_evaluate(
    bot_filtering_model,
    1,

    WITH count_1_min AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
            COUNT(*) AS count_1_min
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_5_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
            COUNT(*) AS count_5_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_30_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
            COUNT(*) AS count_30_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    -- Step 1: Consolidate counts for each user by merging the counts from each interval
    consolidated_counts AS (
        SELECT 
            COALESCE(c1.id, c5.id, c30.id) AS id,
            COALESCE(c1.count_1_min, 0) AS count_1_min,
            COALESCE(c5.count_5_mins, 0) AS count_5_mins,
            COALESCE(c30.count_30_mins, 0) AS count_30_mins
        FROM count_1_min c1
        FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
        FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
    ),

    -- Step 2: Calculate max counts per interval per user with more lenient bot detection rules
    final_features AS (
        SELECT
            id,
            MAX(count_1_min) AS max_count_1_min,
            MAX(count_5_mins) AS max_count_5_mins,
            MAX(count_30_mins) AS max_count_30_mins,
            CASE
                -- Modified bot detection rules to be more lenient
                WHEN (MAX(count_1_min) > 40 AND MAX(count_5_mins) BETWEEN 80 AND 150 AND MAX(count_30_mins) < 300)
                     OR (MAX(count_1_min) > 35 AND MAX(count_5_mins) > 180 AND MAX(count_30_mins) > 600)
                     OR (MAX(count_1_min) BETWEEN 25 AND 40 AND MAX(count_5_mins) BETWEEN 120 AND 200 AND MAX(count_30_mins) > 700)
                     OR ((MAX(count_1_min) > 60 AND MAX(count_5_mins) < 90 AND MAX(count_30_mins) > 400) 
                         OR (MAX(count_1_min) < 30 AND MAX(count_5_mins) > 150 AND MAX(count_30_mins) < 300))
                     OR (MAX(count_1_min) BETWEEN 15 AND 30 AND MAX(count_5_mins) BETWEEN 40 AND 80 AND MAX(count_30_mins) BETWEEN 100 AND 200)
                     OR (MAX(count_1_min) < 15 AND MAX(count_5_mins) < 80 AND MAX(count_30_mins) > 500)
                THEN 1
                ELSE 0
            END AS isBot
        FROM consolidated_counts
        GROUP BY id
    )

    -- Step 3: Select the columns with expected names for model prediction
    SELECT 
        max_count_1_min,
        max_count_5_mins,
        max_count_30_mins,
        isBot,
        id
    FROM final_features
);

The result is:

Before SMOTE approximation (previous image without SMOTE):

AUC ROC: 0.4686
Accuracy: 0.586
Precision: 0.764
Recall: 0.586

After SMOTE approximation (current image with SMOTE):

AUC ROC: 0.4860
Accuracy: 0.595
Precision: 0.767
Recall: 0.595

Analysis of Before and After SMOTE Changes

AUC ROC: The AUC ROC increased slightly from 0.4686 to 0.4860. This indicates a modest improvement in the model’s ability to distinguish between classes after balancing the dataset with SMOTE.
Accuracy: The accuracy also improved slightly, moving from 0.586 to 0.595. This suggests the model has become somewhat more reliable overall with balanced data.
Precision: Precision remains nearly the same, with a minor increase from 0.764 to 0.767. This indicates that the model's ability to correctly identify actual bot cases (positive predictive value) was maintained after balancing.
Recall: The recall increased slightly from 0.586 to 0.595, indicating that the model is now slightly better at capturing more of the actual bot cases.

The metrics show a slight improvement across all areas, especially in AUC ROC and recall. Applying SMOTE has likely helped the model generalize better on the minority class (bot cases) by reducing the imbalance. However, the improvement is modest, suggesting that other strategies, like tuning the model further or experimenting with additional features, may be necessary to achieve substantial gains in performance.

Random Forest Classifier Algorithm

Make sure you use the new_training_data dataset created in the SMOTE section above

A Random Forest model can generally improve performance compared to a single Decision Tree, especially in contexts like bot detection or other classification problems. Here’s why and how it works:

1. Reduction in Overfitting

Decision Tree: A single decision tree tends to overfit the training data, especially if it is allowed to grow deep and learn every detail of the data. This can make the tree highly sensitive to small fluctuations in the data, leading to high variance and poor generalization on new data.
Random Forest: Random forests build multiple decision trees (typically hundreds or thousands) on random subsets of the data and aggregate their predictions. This ensemble approach reduces the risk of overfitting as the "averaging" process smooths out the noise from individual trees, making the model more robust and stable.

2. Improved Accuracy: By combining the outputs of many trees, Random Forest often achieves higher accuracy than a single Decision Tree. Each tree learns different patterns and features, and when their predictions are combined (usually by majority vote for classification or average for regression), the model produces more accurate and reliable predictions. This improvement is especially noticeable in complex datasets with many features or noisy data, where individual trees might struggle to capture all patterns.

3. Reduction in Variance: Random forests reduce variance by averaging the results of multiple decision trees trained on different subsets of the data. This results in a more generalized model, which tends to be more consistent and less sensitive to small changes in the input data.

4. Feature Importance Insights: Random forests also provide more reliable estimates of feature importance compared to a single decision tree. This can be valuable in understanding which features (e.g., specific counts, intervals, or thresholds) are most influential in distinguishing bots from non-bots.

5. Handling Imbalanced Data: Our bot detection dataset is imbalanced, Random Forest is generally more capable than a single Decision Tree in handling this, especially if combined with techniques like SMOTE or weighted classes. Random Forest’s ensemble approach provides a more balanced perspective, making it a good choice for imbalanced data.

Let us create the model using the same feature set:

DROP MODEL IF EXISTS bot_filtering_model;

-- Define the model with transformations and options
CREATE MODEL bot_filtering_model
TRANSFORM (
    numeric_imputer(max_count_1_min, 'mean') imputed_one_minute,       -- Impute missing values in 1-minute count with mean
    numeric_imputer(max_count_5_mins, 'mode') imputed_five_minute,     -- Impute missing values in 5-minute count with mode
    numeric_imputer(max_count_30_mins) imputed_thirty_minute,          -- Impute missing values in 30-minute count
    string_imputer(id, 'unknown') imputed_id,                          -- Impute missing user IDs as 'unknown'
    string_indexer(imputed_id) si_id,                                  -- Index the ID as a numeric feature
    quantile_discretizer(imputed_five_minute) buckets_five,            -- Discretize the 5-minute feature using quantiles
    quantile_discretizer(imputed_thirty_minute) buckets_thirty,        -- Discretize the 30-minute feature using quantiles
    vector_assembler(array(si_id, imputed_one_minute, buckets_five, buckets_thirty)) features,  -- Assemble all features into a single vector
    min_max_scaler(features) scaled_features                           -- Scale features to be within a range of 0 to 1
) 
OPTIONS (
    MODEL_TYPE = 'random_forest_classifier',  -- Change model type to random forest classifier
    NUM_TREES = 20,                           -- Set the number of trees
    MAX_DEPTH = 5,                            -- Set the maximum depth of trees
    LABEL = 'isBot'
) AS
SELECT
    max_count_1_min,
    max_count_5_mins,
    max_count_30_mins,
    isBot,
    id
FROM new_training_data;

If you do the same evaluation:

-- Model prediction query with more lenient bot-detection thresholds, without added randomness
SELECT * 
FROM model_evaluate(
    bot_filtering_model,
    1,

    WITH count_1_min AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 60) AS interval,
            COUNT(*) AS count_1_min
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_5_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 300) AS interval,
            COUNT(*) AS count_5_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    count_30_mins AS (
        SELECT
            id,
            FLOOR(UNIX_TIMESTAMP(timestamp) / 1800) AS interval,
            COUNT(*) AS count_30_mins
        FROM inference_web_data
        GROUP BY id, interval
    ),

    -- Step 1: Consolidate counts for each user by merging the counts from each interval
    consolidated_counts AS (
        SELECT 
            COALESCE(c1.id, c5.id, c30.id) AS id,
            COALESCE(c1.count_1_min, 0) AS count_1_min,
            COALESCE(c5.count_5_mins, 0) AS count_5_mins,
            COALESCE(c30.count_30_mins, 0) AS count_30_mins
        FROM count_1_min c1
        FULL OUTER JOIN count_5_mins c5 ON c1.id = c5.id AND c1.interval = c5.interval
        FULL OUTER JOIN count_30_mins c30 ON c1.id = c30.id AND c1.interval = c30.interval
    ),

    -- Step 2: Calculate max counts per interval per user with more lenient bot detection rules
    final_features AS (
        SELECT
            id,
            MAX(count_1_min) AS max_count_1_min,
            MAX(count_5_mins) AS max_count_5_mins,
            MAX(count_30_mins) AS max_count_30_mins,
            CASE
                -- Modified bot detection rules to be more lenient
                WHEN (MAX(count_1_min) > 40 AND MAX(count_5_mins) BETWEEN 80 AND 150 AND MAX(count_30_mins) < 300)
                     OR (MAX(count_1_min) > 35 AND MAX(count_5_mins) > 180 AND MAX(count_30_mins) > 600)
                     OR (MAX(count_1_min) BETWEEN 25 AND 40 AND MAX(count_5_mins) BETWEEN 120 AND 200 AND MAX(count_30_mins) > 700)
                     OR ((MAX(count_1_min) > 60 AND MAX(count_5_mins) < 90 AND MAX(count_30_mins) > 400) 
                         OR (MAX(count_1_min) < 30 AND MAX(count_5_mins) > 150 AND MAX(count_30_mins) < 300))
                     OR (MAX(count_1_min) BETWEEN 15 AND 30 AND MAX(count_5_mins) BETWEEN 40 AND 80 AND MAX(count_30_mins) BETWEEN 100 AND 200)
                     OR (MAX(count_1_min) < 15 AND MAX(count_5_mins) < 80 AND MAX(count_30_mins) > 500)
                THEN 1
                ELSE 0
            END AS isBot
        FROM consolidated_counts
        GROUP BY id
    )

    -- Step 3: Select the columns with expected names for model prediction
    SELECT 
        max_count_1_min,
        max_count_5_mins,
        max_count_30_mins,
        isBot,
        id
    FROM final_features
);

The results are:

Analysis and Comparisons

Before Results (After SMOTE and Logistic Regression):

AUC ROC: 0.486
Accuracy: 0.595
Precision: 0.767
Recall: 0.595

After Results (After SMOTE and Random Forest Implementation):

AUC ROC: 0.487
Accuracy: 0.596
Precision: 0.767
Recall: 0.596

Insights and Recommendations

The random forest model slightly outperforms logistic regression in terms of AUC ROC, accuracy, and recall, while precision remains identical between the two models. The improvement, although minor, indicates that the random forest leverages its ensemble nature to better capture patterns in the data. The use of SMOTE for synthetic data generation likely contributed to balancing the dataset, enabling both models to achieve reasonable precision and recall. However, the AUC ROC values (~0.487) indicate that the models are struggling to effectively distinguish between bots and non-bots, suggesting that the current features may not capture enough meaningful differences.

To improve performance, consider enhancing feature engineering to include more discriminative features that better separate bots from non-bots. Additionally, hyperparameter tuning for the random forest (e.g., increasing NUM_TREES or MAX_DEPTH) could yield further improvements. Exploring alternative models like gradient-boosting algorithms may also prove beneficial, as they tend to perform better on imbalanced datasets.

Appendix: Generating Balanced Synthetic Data in Data Distiller

Here is the code:

-- Generate balanced synthetic dataset with bot-like and non-bot behavior
SELECT 
    -- Generate unique synthetic ID
    FLOOR(RAND() * 10000000000) AS id,
    
    -- Generate random timestamps within the last year by subtracting random seconds from current timestamp
    TIMESTAMPADD(SECOND, -FLOOR(RAND() * 31536000), CURRENT_TIMESTAMP) AS timestamp,
    
    -- High count for 1-minute interval to simulate bot-like rapid activity for bots, lower for non-bots
    CASE 
        WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 70) + 50  -- Bot-like high count
        ELSE FLOOR(RAND() * 30)  -- Non-bot lower count
    END AS count_1_min,
    
    -- Moderate to high count for 5-minute interval to capture mid-level bot behavior for bots, lower for non-bots
    CASE 
        WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 150) + 100  -- Bot-like moderate to high count
        ELSE FLOOR(RAND() * 80)  -- Non-bot moderate count
    END AS count_5_mins,
    
    -- High count for 30-minute interval to capture long-duration bot-like activity for bots, lower for non-bots
    CASE 
        WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 400) + 500  -- Bot-like high count
        ELSE FLOOR(RAND() * 200) + 50  -- Non-bot lower count
    END AS count_30_mins,

    -- Label half as bots and half as non-bots by using row numbers to alternate bot and non-bot labels
    CASE 
        WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN 1  -- Mark as bot for even rows
        ELSE 0  -- Mark as non-bot for odd rows
    END AS isBot

-- Generate multiple records for a balanced dataset
FROM RANGE(10000);

This query creates a synthetic, balanced dataset to model bot-like behavior versus non-bot (human) behavior. It’s designed to produce realistic variations in activity counts within specific time intervals to simulate patterns that might help distinguish bots from humans.

The query generates a dataset where:

User IDs are randomized: Unique IDs represent individual users.
Timestamps are recent and varied: Random timestamps within the last year simulate user activity over time.
Activity Counts Simulate Bot-like and Non-bot Patterns: The query produces high-frequency activity counts for bots and lower counts for non-bots within 1-minute, 5-minute, and 30-minute intervals.
Balanced Labels: The query labels 50% of the records as bots and the other 50% as non-bots to ensure a balanced dataset, which helps prevent bias when training a classifier.

Let us dig into the code itself:

Generating Unique User IDs:
```
FLOOR(RAND() * 10000000000) AS id,
```
This line creates a unique ID for each record by generating a random number in the range of 0 to 10 billion. Each ID acts as a synthetic user identifier.
Random Timestamps Within the Last Year:
```
TIMESTAMPADD(SECOND, -FLOOR(RAND() * 31536000), CURRENT_TIMESTAMP) AS timestamp,
```
By subtracting a random number of seconds (up to approximately one year) from the current timestamp, this line generates random timestamps within the past year. This simulates varying activity times across users.
Simulating 1-Minute Interval Counts:
```
CASE 
    WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 70) + 50  -- Bot-like high count
    ELSE FLOOR(RAND() * 30)  -- Non-bot lower count
END AS count_1_min,
```
Here, the query uses a CASE statement to assign different activity counts for bots and non-bots:
- Bot-Like Users: Even-numbered rows (simulated as bots) receive a high count (between 50 and 120) to reflect frequent actions within one minute.
- Non-Bot Users: Odd-numbered rows (simulated as non-bots) receive a lower count (up to 30), which reflects less frequent actions within one minute.
This pattern is applied by alternating the output of the ROW_NUMBER() function, where even rows are bots and odd rows are non-bots.
Simulating 5-Minute Interval Counts:
```
sqlCopy codeCASE 
    WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 150) + 100  -- Bot-like moderate to high count
    ELSE FLOOR(RAND() * 80)  -- Non-bot moderate count
END AS count_5_mins,
```
Similarly, this section simulates activity over a 5-minute interval. Bots get a higher range of activity counts (between 100 and 250) to capture moderate-to-high activity. Non-bots receive lower values (up to 80), reflecting normal usage patterns.
Simulating 30-Minute Interval Counts:
```
CASE 
    WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN FLOOR(RAND() * 400) + 500  -- Bot-like high count
    ELSE FLOOR(RAND() * 200) + 50  -- Non-bot lower count
END AS count_30_mins,
```
For a 30-minute interval, bots show consistently high counts (from 500 to 900), reflecting sustained high-frequency activity, while non-bots show lower values (up to 250).
Assigning Bot Labels:
```
CASE 
    WHEN ROW_NUMBER() OVER (ORDER BY RAND()) % 2 = 0 THEN 1  -- Mark as bot for even rows
    ELSE 0  -- Mark as non-bot for odd rows
END AS isBot
```
By alternating between bots and non-bots with the ROW_NUMBER() function, the query ensures an even distribution, which is critical for training a classifier. This balanced labeling helps the model learn to differentiate bot-like behavior from normal human behavior without becoming biased toward one class.
Generating 10,000 Rows:
```
FROM RANGE(10000);
```
Finally, the RANGE(10000) clause creates 10,000 rows of synthetic data, each with its own combination of id, timestamp, activity counts, and bot label.

Last updated 7 months ago