STATSML 101: Learn Basic Python Online
The goal of this module is to teach you basic Python so that you can understand any code that you come across.
Last updated
The goal of this module is to teach you basic Python so that you can understand any code that you come across.
Last updated
Remember the goal of the connectivity to the Data Distiller from Python is to extract a table for analysis. This table is typically a sample that will be stored as a "DataFrame" (a table within Python) and subsequent operations will operate on this local DataFrame. This is no different from downloading the results of a SQL query in Data Distiller as a local file and then reading that into Python. For our training, we will assume that you have extracted this table as a CSV file locally.
To learn Python on the go, we will be using the notebook editor at Kaggle. All you need is an email address to login.
Warning: Do not use Kaggle for uploading any client or customer data. Even if it means that you are sampling the data for prototyping the algorithm. Kaggle is owned by Google and the data is kept on the cloud. You should use Kaggle for learning Python with example data. If you want to prototype with customer data, your best option is a local installation of Python with Jupyterlab as the frontend UI. But make sure you know what the governance policies are in your organization or department.
To upload the data, click on the "+" on the Homepage and "New Dataset". Upload the dataset from your local machine and name it.
Make sure you name the notebook and also add the dataset you uploaded.
To find the path to this file, you will need to click through on the datasets and see the path:
Warning: As you go through this tutorial, you will need to copy and paste the code line by line into a notebook so that the code works correctly. I have intentionally made it this way so that you do not skip key learning points.
The word "pandas" originated from "panel data", a term used for describing multi-dimensional data that vary over time.
Pandas is perhaps the most important library as far as we are concerned as it allows for manipulation and analysis in the Python programming language.
The key features that we will be using:
Data Structures: Pandas introduces two main data structures, the Series and the DataFrame. A Series is essentially a one-dimensional array with labeled indices, while a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or a SQL table.
SQL-Like Operations:
Much like SQL engines,. Pandas provide powerful tools for manipulating and transforming data. You can filter, sort, aggregate, and reshape data easily.
It has functions to handle missing data, duplicate values, and data type conversions, making data-cleaning tasks more manageable.
You can combine multiple data sources through operations like merging, joining, and concatenating DataFrames.
If you need to use Pandas locally, you'll need to install it first. You can install it using the following command in your Python environment:
Tip: SQLAlchemy is very similar in functionality to Pandas but is a library that is geared more toward SQL users with object-centric thinking where SQL constructs like TABLE etc are first-class constructs. Even if you lov SQL, Pandas is important for you to learn.
Execute the following piece of code and ensure that the CSV file path is specified correctly as mentioned earlier
You should see the results as:
Note the following:
import pandas
: This part indicates that you want to use the functionality provided by the Pandas library in your program.
as pd
: This part gives an alias to the imported library. By using pd, you're saying that instead of typing out "pandas" every time you want to use a Pandas function, you can use the shorter alias "pd."
DataFrame
: As mentioned earlier, this is a class provided by the Pandas library that represents a tabular data structure, similar to a table in a database or a spreadsheet.
data
: This is a variable that holds the data you want to use to create the DataFrame. It's usually in the form of a dictionary, where the keys represent the column names and the values represent the data for each column. Here the keys are Name, Age, and Gender.
print(df)
will display the entire DataFrame by default if it's not too large. However, if the DataFrame is large, Pandas will display a summarized view, showing only a subset of rows and columns with an ellipsis (...
) to indicate that there's more data not shown.
Let us now execute:
The only column that will havee statistics is the id column.
df.describe()
will generate statistics on the numerical columns of the DataFrame. This is very similar to ANALYZE TABLE command for computing statistics in Data Distiller.
count
:
The number of non-null values in each column.
mean
:
The mean (average) value of each column.
std
:
The standard deviation, which measures the dispersion of values around the mean.
min
:
The minimum value in each column.
25%
:
The 25th percentile (also known as the first quartile).
50%
:
The 50th percentile (also known as the median).
75%
: The 75th percentile (also known as the third quartile).
max
:
The maximum value in each column.
Let us try to preview the first 10 rows by executing:
Let us count the number of each gender type in the population
Remember that grouped_gender
is a DataFrame. When you use the groupby()
function and then apply an aggregation function like count()
, it returns a DataFrame with the counts of occurrences for each gender type. The above code is very similar to an aggregation COUNT
with GROUP BY
in SQL.
The answer that you will get should look like this:
Other functions that you could have used in place of count() are sum()
, mean()
, std()
, var()
, min(), max()
, and median().
Let us create a function that computes the percentage of total for all the gender types
Note the following:
The hash sign #
is used to create comments.
Note that def
has a semi-colon.
return
should be indented properly.
The function percent_of_total
is applied to each individual element in the column.
percent_of_total_gender
is also a DataFrame as will be obvious from the answers below.
The answers you will get will look like this:
To just retrieve a single column, let us use:
This gives
Alternatively, we could have also created a Series object instead of a DataFrame for percent_of_total_gender
And that would give us the exact same answer.
Let us persist these results that are a Series object into a new DataFram
Results are
We are going to emulate the random number generation as in the example here:
Also, let us take this new column and add it to the DataFrame. Let us execute this code:
The results show that a new column was added:
To learn more about thee random
library, read this.
Let us make our first foray into visualizing the histogram:
matplotlib.pyplot
is a visualization library in Python. It is unfortunate that it sounds very similar to MATLAB which also has plot commands. To plot a chart like the histogram, you can use this site as a reference.
The code is no different from what we used for creating a DataFrame. You first initialize a handle on a library and then access the functions within that library. The function names should be self-explanatory as to what they do.
The results look like this:
The histogram looks messy so let us clean this up:
We added two extra functions
The viz.tick_params(axis='x', rotation=45)
rotates the x-labels by 45 degrees
The viz.tight_layout()
improves the spacing and layout of the plot elements to avoid overlapping.
There is one last thing we want to do here. What if we wanted to plot a histogram and a bar graph together at the same time?
The answer is that if you have ever used MATLAB, the following code will seem similar:
The results will look like:
Note the following in the code:
The heart of this code is fig, axes = viz.subplots(1, 2, figsize=(12, 5))
Much like MATLAB, this function call creates a grid of subplots in a single figure.
1
: The number of rows in the grid.
2
: The number of columns in the grid
figsize=(12, 5)
: This specifies the size of the entire figure in inches. (12, 5)
means the figure will be 12 inches wide and 5 inches tall.
The function returns two objects:
fig
: The figure object, which represents the entire figure.
axes
: A 2D array of subplot axes. In this case, it's a 1x2 array, meaning there's one row and two columns of subplot axes.
fig.tight.layout()
is done at the entire figure level rather than individual charts. That is how this library has been designed.
random
libraryGenerating random data is a good skill to acquire especially in the world of data science. The random
library in Python is a built-in module that provides functions for generating random numbers and performing various random operations.
Here are some of the key functions provided by the random
library:
Generating Random Numbers:
random.random()
: Generates a random float between 0 and 1.
random.randint(a, b)
: Generates a random integer between a
and b
(inclusive).
random.uniform(a, b)
: Generates a random float between a
and b
.
Generating Random Sequences:
random.choice(sequence)
: Returns a random element from the given sequence
random.sample(sequence, k)
: Returns a list of k
unique random elements from the sequence.
random.shuffle(sequence)
: Shuffles the elements in the sequence randomly.
Random Selection:
random.choices(population, weights=None, k=1)
: Returns a list of k
elements randomly selected from the population, possibly with specified weights.
Randomness Simulation:
random.seed(a=None)
: Initializes the random number generator with a seed. Providing the same seed will produce the same sequence of random numbers.
random.random()
These functions generate pseudo-random numbers, which appear random but are actually determined by an initial state (seed) of the random number generator.
Here is some example code to try out:
Remember the syntax for for
loop
You can also use a for
loop with the range()
function to iterate over a sequence of numbers:
Scientific
NumPy: Similar to MATLAB. A fundamental package for scientific computing with Python. It provides support for arrays and matrices, along with mathematical functions to operate on these structures efficiently.
Machine Learning
Scikit-learn: A machine learning library that provides simple and efficient tools for data mining and data analysis. It includes a wide variety of machine-learning algorithms and tools for tasks like classification, regression, clustering, and more.
TensorFlow: An open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, especially neural networks.
PyTorch: Another popular open-source machine learning framework, developed by Facebook's AI Research lab. It's known for its dynamic computation graph and ease of use in building neural networks. It is very popular in the research community.
SQLAlchemy: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It simplifies database interactions and allows you to work with databases in a more Pythonic way. This is required for Data Distiller.
Requests: A simple and elegant HTTP library for making HTTP requests to interact with web services and APIs. This is useful for working with Adobe Experience Platform APIs.