STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
Unlock the power of AI and machine learning in this course—equipping you with the basic concepts to make a real-world impact
In the past, machine learning (ML) and artificial intelligence (AI) were primarily in the realm of data scientists, who specialized in building predictive models, tuning algorithms, and applying statistical techniques. Data engineers played a critical role in preparing, managing, and ensuring the quality of data for these models, but the actual development of AI systems was seen as the domain of data science. However, with recent breakthroughs in deep learning and the emergence of Large Language Models (LLMs), the landscape has shifted, offering data engineers a unique opportunity to dive into machine learning and AI with minimal barriers.
The Shift in Roles: From Data Science to Data Engineering
Traditional machine learning relied heavily on data scientists for tasks like feature engineering, model selection, and hyperparameter tuning. Data engineers provided the foundational support by managing data infrastructure, pipelines, and cleaning datasets, but the bulk of AI work was considered highly specialized and out of reach for most engineers.
As deep learning becomes more prevalent, the demand for robust, scalable data systems has never been higher. LLMs, in particular, need vast datasets to perform at their best, and this is where data engineers can truly excel. The skills that data engineers already possess—building scalable data pipelines, managing large datasets, and ensuring data integrity—are now at the forefront of what makes AI and machine learning successful.
With traditional machine learning becoming more automated, data engineers are no longer confined to backend roles. They are now stepping into the realm of machine learning and AI, playing an active role in deploying models and ensuring that they can process real-world data in real-time.
Here’s how you can now foray into AI and machine learning:
LLMs and Deep Learning Require Data Expertise: LLMs like GPTs rely heavily on data volume and quality. As a data engineer, your expertise in data processing, ETL pipelines, and data warehousing is critical to making sure these models have the resources they need to function optimally. You can now contribute directly to model performance by ensuring clean, well-structured data flows into these AI systems.
AI Systems are Data-Hungry: Deep learning models thrive on large datasets, and data pipelines are needed to continually feed these models. Data engineers, with their experience managing large-scale data pipelines, are perfectly positioned to take on roles previously occupied by data scientists. By optimizing these pipelines, you’re directly contributing to the model’s effectiveness, making you an essential part of the AI development cycle.
Traditional ML Becoming Accessible: As machine learning has matured, many of the previously complex tasks (such as feature engineering and model tuning) have become more accessible through automated platforms. These platforms allow data engineers to apply ML models to their pipelines without needing to dive into the deep statistical theory behind the algorithms. Now, engineers can build models and predictions using tools that were once exclusive to data scientists, lowering the entry barrier into AI.
Data-Centric AI: Modern AI systems are becoming more data-centric, meaning that the quality, diversity, and volume of the data often matter more than the complexity of the model itself. Data engineers are the ones with the expertise to ensure that AI models get the best data possible. In this data-centric era, you’re no longer simply feeding data to models—you’re shaping the quality of the insights they produce.
Who is This Concept Course For?
If you're a Data Distiller user exploring the world of data, this course is for you. As you work through the material, you'll find that many of the concepts are universal, and you’ve likely already encountered them in some form. The goal is to break down the jargon and technical terms you’ll come across, helping you see through complex arguments and understand them clearly.
If you ever find yourself confused or lost, always return to your data. Ask the simple questions—remember, data doesn’t lie. Don’t be swayed by flashy results or "intelligent" models without first understanding what the algorithm or data is really doing. Be skeptical of claims of "intelligence." Embrace the insights it offers, but learn how to manage and evaluate it. Know what it does well and when it falls short. In the end, be the one in control of the tools, not the other way around.
Author's Preface
This chapter draws from my research in robotics, computer vision and neuroscience, where I observed the convergence of mathematics and various branches of engineering, shaping the evolution of machine learning. Traditionally, many machine learning algorithms are direct applications of statistical methods, relying on mathematical principles to perform tasks like classification, regression, and clustering. These methods have served as foundational tools for analyzing data and making predictions. However, with the advent of modern computation, deep learning has emerged as a transformative force, pushing the boundaries of what machine learning and AI can achieve. Unlike traditional algorithms that depend on predefined statistical models, deep learning uses neural networks to automatically learn and uncover intricate patterns in massive datasets. This intersection of computational power, mathematics, and engineering has paved the way for groundbreaking advances, particularly in fields like computer vision and natural language processing, making deep learning the new frontier in machine learning and AI.
There was a time when terms like machine learning, artificial intelligence, and machine vision were confined to academic circles, seen as esoteric concepts understood only by a select few. Neural networks, once popular, eventually fell out of favor, largely due to the technical challenges that arose. While the lack of computational power to simulate large neural networks was a significant hurdle, there were also deep theoretical limitations. By the late 1970s, interest in the field had waned. Today, however, deep learning is a hot topic, fueled by affordable computing and the rise of big data.
Regardless of the trends that come and go, the core principles of how we learn—and how we expect machines, robots, or agents (whatever we choose to call them) to learn—remain fundamentally similar. In many respects, what we create in these systems is a reflection of how we, as biological beings, are wired.
The purpose of these notes is to equip you with the knowledge to ask insightful questions about machine learning that go beyond just the math or the algorithms. Don’t be swayed by those who throw complex equations at you or present flashy results. If you ask the right questions and seek clear explanations, you’ll be able to discern whether what’s being presented is legitimate or not. And that’s important—it saves you both time and money.
Learning Machines
A learning machine, broadly defined, is any device whose actions are influenced by past experiences. A subclass of learning machines is those that learn to discover hidden relationships in data. This also means that they can be trained to recognize patterns.
Brains are also learning machines that condition, parse, and store data. Neural networks are non-linear dynamical systems with many degrees of freedom that can be used to solve computational problems. The mathematical foundations for learning machines were laid down by a group of researchers in the 1940s and 1950s. There are related concepts of pattern, pattern classification, discriminant functions, and decision surfaces.
Machine Learning (ML) vs. Artificial Intelligence (AI)
Many people tend to conflate terms like "AI/ML" without recognizing the key differences. However, "learning" and "intelligence" are not the same. You might learn less than others but still demonstrate greater intelligence when tackling a decision-making problem.
Setting aside the technicalities, it's important to note that the fields have evolved differently. Think of machine learning as focusing on specialized, functional expertise, while artificial intelligence aims for broader generalization across multiple domains.
Machine Learning (ML) | Artificial Intelligence (AI) |
---|---|
Learn from Data: Learn patterns and make predictions from data. It involves training models on large datasets to improve performance on specific use cases. Learning from data is a fundamental aspect of ML. | Build an Intelligent System: These can be rule-based or symbolic or use a knowledge base to reason when faced with a context. They may not always involve learning from data. Some AI systems are rule-driven and do not adapt or learn from new information.. You will see these in robotics where certain aspects are all rule-based while individual functions are running decentralized algorithms that are all ML-based. |
Functional Learning: You could design machine learning algorithms for different tasks. | System-Level Decisions: Based on what you can learn from what you perceive, make decisions i.e. should I take the chance and attack |
Generative Language Models are fundamentally machine learning models. However, they have learned such an extensive range of patterns that they now display emergent capabilities that resemble aspects of human intelligence.
What is the Essence of Learning?
Learning is not just confined to machines.
The fundamental goal of any learning task is to be able to generalize i.e. ability to apply principles and concepts learned from examples to a new problem. This is pretty much in line with our educational experience in school or college.
In the most general form, what this means is the following that you should be able to generate a solution "y
" for any new problem "x
" thrown at you. 'f" denotes the algorithm that you would use based on the examples that have been taught to you i.e. pairs of (x1, y1), (x2, y2)
, etc
x1
and x2
itself could be a vector of values (latitude and longitude for example). We may choose only one of these values in our modeling because we find that it is a "feature" that seems to have a lot of influence on the output. We call the values we chose as "features" and the vectors as "feature vectors". They are no different from the attributes or dimensions that you encounter in the relational world except that you are being smart about how many you actually need.
Remember that f
is an algorithm or a technique that you are using to establish the relationship between the inputs x
and outputs y
. It is not that you are discovering the actual mathematical relationship between the two in the real world. Your function may be an approximation of that real-world function because you observed data that was perhaps confined to a smaller subset. Your ability to establish relationships is only a function of the data you have. Plus, you will also need to make some assumptions about the nature of the data x
and the output y
so that you can solve the problem in a cost-effective way. Hence, if you do not state your assumptions, you are fooling yourself.
Some comments can be made about the way you are taught these examples:
Quality of the training set: These examples should have variety so that you are able to learn the key concept from multiple points of view.
Reducing Redundant Information: Imagine e giving you a textbook with paragraphs that are repeated multiple times throughout the text. This repetition can slow you down or worse, waste your time which is the resource that you have. You would take a marker and just mark these duplicate paragraphs out. You just reduced the size of the data within a chapter without losing the essence of what the chapter was. You will hear machine learning folks call out things such as "dimensionality reduction" which is similar to the above idea.
Overfitting: You could spend all your time mastering specific examples to the point where you ace every test based on them. Your accuracy with these examples becomes exceptional—almost like you've memorized them. In machine learning terms, this is called "overfitting"—when you've learned the examples so well that you've essentially memorized them. The downside is that your ability to generalize and handle new, unseen questions (those outside the "syllabus") becomes compromised.
Underfitting: However, the test may not include those exact examples. There will be some problems similar to what you were taught, and you'll likely do well on those. But there will also be problems that are quite different, requiring you to apply what you've learned in new ways. Your success with these unfamiliar challenges depends on how much practice you've had with similar types of problems. In machine learning terms, if you haven’t learned the examples well enough (you can generalize but struggle with accuracy), this is known as "underfitting."
Cost: You have a finite time to finish your preparation for the test. Hence to maximize your chances of a high score, you need a strategy.
Test: Your learning strategy depends on the nature of the test:
If most questions are similar to what you were taught, it makes sense to focus on revising those examples. This is why rule-based or rote-learning strategies are popular with many students—they work well when the test closely mirrors the material.
However, if the majority of questions are different from what you’ve learned, you're better off practicing a wider variety of problems to improve your ability to adapt to new situations.
In reality, finding the right balance between these strategies is key. In many ways, you're making a bet on the future—there’s no certainty about what will happen, but you must take calculated risks. Life, like learning, is about navigating uncertainty.
Feedback: Suppose you have the chance to retake the test. It would be incredibly helpful to analyze the pattern of questions that were asked, understand which ones you answered correctly (and why), and identify those you got wrong (and the reasons behind it).
Overfitting is commonly associated with low bias and high variance, which indicate how well a model captures the complexity of the task. Low bias means the model makes minimal errors by accurately reflecting the complexity of the data without oversimplifying. However, as the model becomes more complex, it also starts learning irrelevant details or "noise," which increases variance. Variance refers to the model's sensitivity to changes in the training data, making it less capable of generalizing to new, unseen data.
Underfitting is typically associated with high bias and low variance, which indicate how poorly a model captures the complexity of the task. High bias means the model makes significant errors because it oversimplifies the data, failing to learn important patterns. In this case, the model is too simplistic and cannot capture the underlying structure of the data. While the model has low variance, meaning it is not sensitive to fluctuations in the training data, it struggles to perform well even on the training set, let alone on new, unseen data. This results in poor overall performance and lack of generalization.
Reinforcement learning is a concept associated with psychologists like B.F. Skinner. Reinforcement learning focuses on how behaviors are learned and modified through the consequences that follow those behaviors, such as rewards and punishments. That is why having a "reward" manifest via some form of appreciation encourages us to "adapt" and hence learn.
Self-motivation is very powerful when you are searching for "training examples" to "learn" new concepts. Unlike robots that cannot be self-directed like us, this quality is what leads to us coming up with newer and faster ways of doing things.
Machines and Organisms: Overview of Neural Networks
We often hear about deep learning because these neural networks can have many layers with billions of parameters. But what are these neurons, and how do they manage to approximate real-world relationships so well? The answer lies in how we humans do it. By mimicking how our brains work, we bring that ability into the machines and agents we build.
The building block of all the wiring in our brain is the neuron (or nerve cell). A neuron can either fire (transmit a signal, represented as 1) or remain inactive (transmit a 0). At their core, both neurons and electronic logic gates exhibit binary behavior—they either produce an output or don’t, depending on certain conditions or thresholds. By connecting these neurons in a network, we can create any logical system.
The key takeaway here is that we can use simple, non-linear units (neurons or gates) to form multiple layers of complex behavior. The challenge is in figuring out the structure, pathways, and priorities needed to generate the desired outputs.
As long as we can represent input and output data in binary form, it’s possible to wire up circuits in the brain that map inputs to outputs. Learning is the process by which the brain builds that mapping. It’s often said, "If I can think it, I can learn it without physically doing it." While this is partially true—because you can mentally prepare the brain for learning—you still need physical interaction and stimuli to make the learning truly effective.
Did you know that in basic mathematics, the Weierstrass approximation theorem states that any continuous function defined on a closed interval [a, b] can be approximated as closely as desired by a polynomial function? Continuous functions are everywhere in the physical world.
Interestingly, neural networks can approximate polynomial functions too. In fact, neural networks are universal function approximators, meaning they can approximate a wide range of functions, including polynomials. The key to this capability lies in their architecture and the activation functions used within the network.
That’s quite remarkable—it sheds light on how we are able to learn and understand physics-based functions, which are often continuous by nature. It’s also important to note that this ability to learn implies we can do so within a finite amount of time.
Here is what we would need to consider:
Architecture: To approximate polynomial functions, you can use a feedforward neural network with one or more hidden layers. The number of neurons in each hidden layer and the overall depth of the network can be adjusted depending on the complexity of the polynomial you're trying to approximate.
Activation Functions: Common activation functions like the sigmoid can be applied in the hidden layers. These activation functions introduce non-linearity, enabling the network to model more complex relationships and behaviors.
Training: The network is trained using a dataset of input-output pairs, where the inputs are the values of the independent variable (e.g., x) and the outputs are the corresponding values of the polynomial function (e.g., f(x)). During training, the neural network adjusts its weights and biases to minimize the difference between its predicted outputs and the actual values in the dataset.
Performance Evaluation with Loss Functions: A typical loss function for regression tasks, such as approximating a polynomial, is mean squared error (MSE). MSE measures the average squared difference between the predicted and actual values, similar to how Root Mean Square (RMS) values are used in electrical engineering. Just as RMS captures the power intensity of a signal without considering its peaks or signed values, MSE provides a way to assess the overall error without focusing on outliers.
Optimization: Gradient-based optimization algorithms are employed to minimize the loss function and iteratively update the network’s parameters, improving the model’s performance with each iteration.
Do I Need More Neurons? The number of neural networks, or neurons, in the human brain, is relatively consistent across individuals. However, what can differ significantly among individuals is the efficiency and organization of these neural networks. Highly intelligent people often exhibit more efficient and optimized neural connections, allowing for quicker and more effective processing of information. It's not about having more neurons, but rather how those neurons are connected and function.
Our cognitive abilities are shaped by our experiences. In fact, evolutionary biologists suggest that human intelligence evolved in response to the physical body's interactions with the environment. For instance, the ability to reach out and pluck a fruit—a seemingly simple task—requires a level of intelligence not found in many other species. This interaction between physical tasks and mental development highlights how intelligence may have evolved as a practical adaptation to our surroundings.
Modeling Human Memory: One way to think about human memory is as an implicit "lookup table." However, unlike a computer, the human brain doesn’t have dedicated "storage cells" to hold individual bits of information. Instead, we rely on neurons, and the information is encoded within neural networks. These "lookups" are essential components of the knowledge we use to navigate the real world. But building true intelligence requires much more than just machine learning.
Structural Layout of Neural Networks
One of the key findings about the brain as to what we perceive as different forms of "intelligence" has to do with the size and number of neurons, connections, and layers. The reason why each of us is "intelligent" in a different way is because our brains have figured out efficient ways of organizing the neurons and the flow of information. In fact, "insight" or "aha moments" are supposed to be conscious moments of us discovering these pathways.
Many of our abilities—such as perception, movement (which we begin learning as toddlers), language, thought, and memory—are made possible by the interconnected network of both serial and parallel processors in distinct regions of the brain, each responsible for specific functions. If one area of the brain is damaged, you are not entirely impaired; the brain has the remarkable ability to reorganize its processing units to recover lost functions. However, this reorganization takes time and requires training, which is closely tied to motivation and encouragement.
As you add neural layers from the input to the output, you are creating higher-level abstractions. Each stage can be thought of as the input to the next stage. The last stage prior to the final output should have the highest abstraction. Layers add abstraction and refinement to the learning.
In the human brain, over time, without training or inferencing, these links can become weak and you experience forgetting what you have learned. This may not be a bad thing as we all know -overcoming a bad experience can be addressed by giving "it time" when we engage in different activities or even changing our environment.
Creativity and Gaps in Learning: One of the greatest challenges for renowned guitarists like David Gilmour, or even entire bands, is making each song and album sound distinct. The music is often tied to a specific period in the band’s history, and the songs tend to shape the guitar playing accordingly. Breaking free from what the band has already produced is extremely difficult. This likely explains why Pink Floyd took extended breaks between albums—they needed time to "reset" and allow their creative links to fade. In contrast, a band like U2, later in their career, began releasing albums almost annually, but the music started to blend together, leading to diminishing returns. Creativity thrives on long breaks, and it’s challenging to continuously generate fresh, groundbreaking ideas without giving the mind time to recharge.
Marketer-Machine Learning Analogies
If we consider how marketers learn, we can find many parallels with machine learning. While the fundamental process of learning is similar, the mechanisms used by humans are more adaptable due to evolution.
These techniques are not only relevant to marketing but also commonly applied in education, training, and personal development for acquiring new skills. Below are some common learning methods used by marketers and their corresponding machine learning techniques:
Currently, most machine learning algorithms require human involvement at various stages, such as data labeling, model selection, and evaluation. However, in the future, it's possible that these algorithms could bootstrap themselves, autonomously optimizing their learning processes through techniques like self-supervised learning or reinforcement learning, potentially reducing or eliminating the need for human intervention.
Technique | Human Learning | Machine Learning |
---|---|---|
Participative Learning | In marketing, participative learning involves engaging with customer insights actively, using feedback sessions, brainstorming, and A/B testing to refine strategies. Discussions and debates within teams lead to creative solutions and improvements in campaigns. | Active Learning refers to algorithms identifying the most valuable customer segments or touchpoints to focus on for better engagement, using minimal data points for maximum insight. These models prioritize which customer data to further analyze, improving targeting efficiency. |
Visual/Auditory/Lab Learning | Marketers absorb information through visual aids like customer journey maps, heatmaps, and campaign performance charts. Listening to customer feedback through interviews, surveys, and social media also drives understanding. Hands-on learning comes from testing campaign strategies in the market. | Computer Vision/NLP: Machine learning models process customer data from various media—analyzing images (e.g., product photos), speech (e.g., call center data), and text (e.g., reviews) to extract insights on customer preferences, sentiment, and behavior for optimized marketing strategies. |
Mnemonic Devices | Marketers use mnemonics or frameworks like the "4 Ps of Marketing" (Product, Price, Place, Promotion) to organize their strategies and recall best practices. These frameworks help marketers maintain consistency and effectiveness across campaigns. | Feature Engineering: In machine learning, marketers work with engineers to create features that make data more actionable. For example, segmenting customers based on engagement patterns or purchasing behavior allows for more targeted marketing efforts and better customer insights. |
Mind Mapping | Marketers use mind mapping to visually organize campaign ideas, segment customers, and brainstorm strategies for product launches or promotions. | Topic Modeling (LDA): Machine learning can analyze large sets of customer text data (e.g., social media or reviews) to identify key topics and themes, providing insights that help marketers organize strategies and content. |
Flashcards | Marketers use key statistics or insights (like flashcards) to quickly memorize important facts about customer segments, brand values, or trends that can be applied in campaigns. | Labeling: In machine learning, labeling data points allows models to understand what they represent (e.g., identifying customer behaviors or preferences), helping models learn effectively from labeled data. |
Collaborative Learning | In marketing, collaboration between teams (e.g., creative and analytics) helps solve problems, share insights, and improve strategies through diverse perspectives. | Ensemble Learning: Just as teams work together, ensemble learning combines multiple models to improve predictions in marketing, such as combining models for customer segmentation, churn prediction, or ad targeting for better performance. |
Gamification | In marketing, gamification involves using game-like elements (e.g., points, competitions, or rewards) in customer engagement strategies to increase motivation and interaction. Loyalty programs or contests are examples. | Reinforcement Learning: This can be used to optimize customer engagement by rewarding behaviors that lead to desired actions (e.g., purchases or loyalty). Models learn by experimenting with various tactics and adjusting based on results. |
Critical Thinking | Marketers analyze and evaluate data from campaigns to make informed decisions. They synthesize customer insights, feedback, and market trends to refine strategies and solve problems. | This is not ML but mostly AI. Expert Systems: AI systems in marketing can simulate expert problem-solving by analyzing customer data and making decisions on the best campaign strategies, content, or offers for specific audiences. |
Problem-Based | Marketers learn by addressing real-world challenges such as optimizing campaigns or solving customer pain points. This hands-on learning helps marketers enhance problem-solving skills. | Training Dataset: Machine learning models in marketing are trained on real-world customer data, learning from past behaviors to predict future actions and help optimize marketing strategies. |
Socratic Questioning | In marketing, asking open-ended questions can stimulate creative thinking and drive deeper exploration of customer needs, leading to more innovative solutions. | Data Exploration: In machine learning, exploring customer data involves asking open-ended questions to uncover trends, patterns, or insights that can guide strategic decisions and future campaigns. |
Feedback and Assessment | Marketing strategies are constantly refined based on customer feedback and assessment of campaign performance. This process helps identify areas for improvement. | Evaluation Metrics: In machine learning, you would assess model performance using metrics such as accuracy, recall, and precision. Feedback from the model's performance helps refine marketing campaigns. |
How do I Implement a Machine Learning (ML) Algorithm?
There are essentially the key steps in this process and some of them are iterative:
Use Case Analysis: Truth be told - if you do not understand what is the goal of the algorithm you are implementing and what exactly you are looking for, then you will design something that no one understands and will not be useful.
Budget: You should ask questions about the budget. Remember that every prediction you make is a cost that has to be allocated to some business process. Budget will decide the complexity of the model, the size of the data, and how frequently you want to train.
Data Exploration: You need to understand the volume and variety of the data. More importantly, you have to get a feel for the structure of the data. Analysis is an absolute requirement - so improve your SQL and Python skills if you have not.
Feature Engineering: You will have to make choices on the attributes you care about and the dimensionality of your data. You will also need to represent the data in a different topological space as required by the algorithm or the nature of the data itself.
Test Drive Multiple ML Algorithms: You will need to sample and test multiple ML algorithms. You may need to start architecting a strategy as to whether you want to mix and match them.
Evaluation of performance metrics becomes absolutely critical at this step. It is standard practice now to split the data into training and test data.
Parameter Tuning: Your algorithm will be using parameters and you want to be able to tune them so that you get the best performance or get the desired behavior for your use case for the same performance.
Modeling Assumptions: Every algorithm you choose assumes certain facts are true about the input/output data and the model. You need to be able to articulate what those assumptions are.
Explainability: At this point, you should be able to articulate what the algorithm is doing, where it fails, and what it needs in order to become better.
Train and Optimize the Models in Production: Choose a state-of-the-art production system to do the training.
Model Inferencing: Use the trained model on the live data to make predictions.
Retrain the Model: Remember the heart of the algorithm is the data. You are always looking for data to train on that has the most impact on the performance in the future. The decision to retrain on the full data or the new data depends on many factors:
Drifts: The structure of the data or even the relationship between the features and the output may change over time. You will see this as a degradation in the model inferencing. If this happens, I would give more weight to the new data and decide when I want to retrain.
Volume of Data: Sometimes the volume of the data is so large, that it may not be worth the spend to do it all over again. Or I may want to sample my way through this.
Regulatory: Sometimes regulatory requirements such as data retention laws will require you to retrain. Here I do not have a choice.
Nature of the Algorithm: If the algorithm makes predictions by relying on interrelationships between data (in time and space), then you do not have a choice but to retrain.
Reality Check: Steps 3, 4, and 5 will take up about 80% of your time. It's often joked in machine learning circles that data scientists spend most of their time on BI and dashboards, leaving little room for "real" data science. The reality is that many algorithms and deployment tools are readily available off-the-shelf. What truly distinguishes you is your ability to apply these tools effectively, starting with a deep understanding of the data and developing a strong algorithmic strategy.
Key Machine Learning Algorithms
Remember that any form of prediction is ultimately inductive reasoning meaning that we are trying to generalize or predict based on specific observations or data/examples. Generalization is what we are looking for. If we cannot induce the prediction, then our learning has failed. Deductive reasoning goes the other way around - you start with the first principles, rules, and assumptions and build your way to the conclusion (which could be a generalization). It is very akin to proving a theorem in geometry.
A Comment on Supervision: You’ll often come across terms like supervised and unsupervised learning. Supervised learning means that a human (like you or me) is involved in labeling the data for training. In contrast, unsupervised learning occurs when the algorithm identifies patterns or labels on its own, as seen in clustering. Many techniques are actually semi-supervised, blending elements of both supervised and unsupervised approaches.
Most machine learning algorithms typically focus on either predicting numerical values (regression) or classifying data into categories (classification). However, some algorithms are designed for tasks like clustering, dimensionality reduction, or generating new data (generative models).
If you read machine learning books, you will come across some of these algorithms:
Category | Learner Example | What it does | Key Use Case |
---|---|---|---|
Regression | Linear and Nonlinear regression | Establish a linear or nonlinear relationship between the inputs and outputs. | Average spend modeled as a function of income, family size and geographic location. |
Regression | ARIMA (Time Series Forecasting) | Very popular in time series forecasting. It captures the relationship between a time series and its lagged values (Auto-Regressive). The "Integrated" component is the differencing needed to make the time series stationary. The "Moving Average" component models the current value on past forecast errors (residuals). | Segment trends, revenue trends, web traffic trends. |
Clustering | k-means | Key idea is to start with 'K' centroids and iteratively assign points to these K clusters and keep updating the centroids of the points assigned. | Customer segmentation, anomaly detection |
Classification | Support Vector Machines (SVM) | SVMs seek to find the optimal hyperplane that separates data points into different classes. SVMs are capable of fitting to nonlinear data, as you saw earlier in this section. SVMs use a clever technique in order to fit to nonlinear data: the kernel trick. A kernel is a mathematical construct that can “warp” the space where the data lives. The algorithm can then find a linear boundary in this warped space, making the boundary nonlinear in the original space. | Customer segmentation, anomaly detection |
Classification | K-Nearest Neighbors | Predict the class of a new data point based on the majority class (for classification) or the average of nearby data points. | Location-Based Targeting |
Classification | Naive Bayes | The key idea is to use assume that all events are independent (naive assumption) and then use the simplified conditional probabilities to classify a new event. | Email opt-out analysis |
Classification | Logistic Regression | This is very similar to numerical regression except that you have a discrete output variable. | Propensity modeling |
Classification | Old-Fashioned Decision Trees | Key idea is to partition the feature space so that you can localize the output possibilities. Very similar to nested IF THEN ELSE semantics. | Customer segmentation, lead scoring, market basket analysis, recommendations |
Classification | Random Forest | The idea is to combines the predictions of multiple (ensemble) decision trees to make better predictions than one single decision tree. | Customer segmentation, lead scoring, market basket analysis, recommendations |
So Many Options, So Little Time: All of the machine learning algorithms mentioned are simply mathematical models representing real-world phenomena. As a result, many of them can be used to tackle the same set of problems. For instance, you could apply a deep learning algorithm to solve a basic linear regression task. However, the general rule of thumb is to opt for the simplest algorithm and prioritize gathering as much data as possible. In the long run, data volume trumps algorithmic complexity. In the world of AI, the real battle is always about the data, not the algorithm.
Heuristics: No matter which algorithm you use to generate results, applying heuristics or practical judgment is essential. This is crucial in machine learning, whether you're doing feature engineering or choosing an algorithm strategy. The core idea is to find solutions efficiently by using simple rules of thumb or shortcuts that draw on prior knowledge and experience. Keep in mind that this knowledge is often domain-specific and can vary based on the specific problem you're trying to solve.
Bagging or Bootstrapping
In the early days of machine learning, people would often advocate for their preferred algorithm. In new scenarios, they would compare multiple algorithms to find the best one. Nowadays, the standard approach is a "combo" technique known as "ensembles." The idea is similar to a team-based effort, where each model in the ensemble works together—either by covering each other's weaknesses or combining their outputs to achieve better results. Interestingly, this teamwork strategy, even with less individually powerful models, often outperforms using the best individual model alone. However, the challenge lies in determining, from a strategic perspective, which model should handle which part of the task. Two key concepts come into play here:
Bagging or Bootstrap Aggregating: In bagging, multiple instances of the same base model are trained in parallel on different subsets of the data and their predictions are combined. The base models can be of the same type but trained on different subsets of the data with randomness introduced through resampling. As mentioned above, the Random Forest is a bagging ensemble of decision trees. Each decision tree is different and trained independently, and their predictions are aggregated to make the final prediction.
Boosting: In boosting, base models are trained sequentially, and each subsequent model is trained to correct the errors made by the previous ones. The base models in boosting can be different types of models, and they are weighted based on their performance. AdaBoost (Adaptive Boosting) can use a variety of base models, such as decision trees, linear models, or other classifiers, in a sequential manner
Bootstrapping: You'll hear this term frequently in machine learning. The concept involves random sampling with replacement, meaning that each sample can be selected multiple times, while others may not be selected at all.
For example, suppose you have a list of numbers: [55, 56, 57, 58, 59, 60]
. Your task is to create three samples of size 3 using the bootstrap method:
Sample 1 could be:
[55, 55, 55]
Sample 2 could be:
[55, 56, 57]
Sample 3 could be:
[55, 59, 59]
In this case, some numbers, like 55, appeared multiple times, while others weren’t selected at all. This happens because we used the "with replacement" approach—after each selection, the chosen number (e.g., 55) is placed back into the pool, making it available for selection in subsequent draws.
Bootstrapping is particularly useful for small datasets since it allows you to generate numerous samples as needed. It also eliminates the need to make assumptions about the underlying distribution of the data (assuming independence). However, this assumption may not hold true for time series data, where dependencies like seasonal fluctuations could exist.
Mix and Match Algorithms: In ensemble machine learning, the models you combine don’t need to be of the same type. One of the key benefits of ensemble methods is their ability to improve predictive performance by leveraging diverse models. Ensembles work by aggregating the predictions of multiple base models (often called "base learners" or "weak learners") to make a final prediction. These base models can vary in type and even use entirely different algorithms, allowing for a more robust and versatile approach to problem-solving.
Performance Metrics
The choice of performance metrics is determined by the goals of the specific machine learning task and the preferences of the data scientist. Here are the typical steps involved in selecting performance metrics for a machine learning algorithm:
Business Objectives: The first and foremost metric should be that the metrics should align with the broader business objectives. For example, in an e-commerce recommendation system, the goal might be to maximize conversion rates. This must be weighed against the cost of developing and maintaining the algorithm and data.
Type of Problem: The second step is to define the ML problem. Different problems require different metrics. For example:
Regression: For regression tasks (e.g., predicting house prices), common metrics include mean squared error (MSE) and R-squared (R²)
Classification: If you are solving a classification problem (e.g., spam detection), you might use metrics like accuracy, precision, recall, F1-score, and ROC AUC.
Clustering: Clustering algorithms might use metrics like silhouette score or Davies-Bouldin index.
Industry Specific: Consider industry-specific knowledge and constraints. Some metrics may be more relevant or meaningful in certain domains. For instance, in medical diagnosis, sensitivity and specificity may be crucial.
Data Characteristics: The nature of your data can influence metric selection. Noisy data might need robust metrics.
Be cautious of imbalanced data. This occurs when one class is significantly underrepresented compared to another, which can create challenges during model training and evaluation due to skewed class distributions. In contrast, balanced data provides a more even distribution across classes, making it easier to train models and evaluate their performance accurately. If avoiding imbalanced data isn't possible, you may need to use alternative performance metrics to properly assess your model's effectiveness.
Linear/Nonlinear Numerical Regression
Some common metrics used are Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R²).
Mean Absolute Error (MAE): MAE (Mean Absolute Error) shows how much the model’s predictions differ from the actual values, on average. A smaller MAE means the predictions are generally closer to the real values. It gives an idea of the "average" error the model makes, without worrying whether it predicted too high or too low.
Mean Squared Error (MSE): MSE (Mean Squared Error) measures how far the predicted values are from the actual values, but it squares the differences, so bigger mistakes have a much larger impact. This makes MSE especially useful when you want to place more importance on larger errors, making sure the model doesn’t make any big mistakes.
R-squared (R²) also known as the Coefficient of Determination, shows how well a regression model fits the actual data. It tells you how much of the variation in the outcome (the thing you're trying to predict) is explained by the input variables (the things you're using to make the prediction). In simple terms, it helps you understand how well your model matches the real data points.
Root Mean Squared Error (RMSE): RMSE is the square root of MSE, and it's expressed in the same units as the target variable. It is similar to MSE.
Classification Algorithms
Let’s frame this section in terms of taking a test with a mix of easy and hard questions. As the test taker, you don’t know in advance which questions are easy and which are hard—you have to make educated guesses and decide where to focus your time. You can’t answer every question, so you need to choose a strategy. Let’s say your strategy is to go after the hard questions because that’s where your strengths lie. The key metrics you’ll use to evaluate your performance on this test are similar to how machine learning models are assessed: Accuracy, Precision, Recall, F1-Score, and ROC AUC.
Just as strategy plays a vital role in balancing speed versus stability for something like a self-balancing robot, your approach here balances several factors: how well you find the hard questions (recall), avoid unnecessary mistakes (precision), and how close your overall performance is to reality (accuracy). These metrics help you evaluate whether your test-taking approach is effective, much like they help measure the performance of machine learning models.
Accuracy: Accuracy represents your overall performance on the test. It’s like getting a percentage score based on how many questions you answered correctly out of the total attempted. However, accuracy can be misleading if the test is imbalanced—for instance, if there are mostly easy questions and just a few difficult ones. Even if you get all the easy ones right, your accuracy might still look high, but you may have missed the more challenging, critical questions. Remember, this metric doesn't penalize you for questions that you left unattempted.
Precision: Precision is about avoiding mistakes, or false positives. In this test analogy, it’s like ensuring you only answer the hard questions and don’t waste time on the easy ones. Precision measures how well you stayed focused on answering the difficult questions correctly without mistakenly tackling the easy ones. If you end up answering questions you shouldn’t have, your precision drops. If you didn’t attempt certain questions, they aren't factored into precision.
Recall (Sensitivity) or the True Positive Rate: In terms of our test analogy, recall is like your ability to find and answer all the difficult questions (true positives). If you miss some of these tough questions (false negatives), your recall suffers. High recall means you've successfully identified and attempted most of the hard questions, which is key if your goal is to tackle the challenging ones. Note that this metric does capture the unattempted ones as your failure to identify them is a recall failure.
Fall-Out (False Positive Rate): This refers to how often you mistakenly answered easy questions (false positives) when your goal was to focus on the hard ones. It measures the proportion of easy questions you incorrectly attempted out of all the easy questions on the test.
ROC AUC (Receiver Operating Characteristic - Area Under the Curve) helps evaluate how well you balance answering the difficult questions (high recall) while avoiding the easy ones (low fall-out) in our test analogy. The ROC curve plots Recall (True Positive Rate) against Fall-Out (False Positive Rate) at different decision points, or thresholds, representing how you performed under varying strategies. At the end of the day, a higher AUC combined with a high true positive rate (recall) means you are good at answering the important, difficult questions while staying focused and avoiding unnecessary errors with the easier ones. This makes AUC a comprehensive measure of how well your test-taking strategy worked.
Unwanted/Easy Questions as False Positives: The "easy" questions that you mistakenly answered are your false positives. If you answer many easy questions that you weren’t supposed to, especially if you get them wrong, your precision decreases because you've deviated from your focus on the difficult ones.
In multi-class classification, precision is typically calculated independently for each class, and then you might calculate an average precision across all classes if needed.
Last updated