Page cover

STATSML 101: Learn Basic Python Online

The goal of this module is to teach you basic Python so that you can understand any code that you come across.

Scenario

Remember the goal of the connectivity to the Data Distiller from Python is to extract a table for analysis. This table is typically a sample that will be stored as a "DataFrame" (a table within Python) and subsequent operations will operate on this local DataFrame. This is no different from downloading the results of a SQL query in Data Distiller as a local file and then reading that into Python. For our training, we will assume that you have extracted this table as a CSV file locally.

Create an Account in Kaggle

To learn Python on the go, we will be using the notebook editor at Kaggle. All you need is an email address to login.

Create a New Notebook. You can also add a New Dataset

Upload Test Data

To upload the data, click on the "+" on the Homepage and "New Dataset". Upload the dataset from your local machine and name it.

Create a New Notebook

Make sure you name the notebook and also add the dataset you uploaded.

Add Python101Data data source along with the CSV data to your notebook.

To find the path to this file, you will need to click through on the datasets and see the path:

Make sure you click on the data source and access the CSV file to get the full path.

What is Pandas

The word "pandas" originated from "panel data", a term used for describing multi-dimensional data that vary over time.

Pandas is perhaps the most important library as far as we are concerned as it allows for manipulation and analysis in the Python programming language.

The key features that we will be using:

  1. Data Structures: Pandas introduces two main data structures, the Series and the DataFrame. A Series is essentially a one-dimensional array with labeled indices, while a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or a SQL table.

  2. SQL-Like Operations:

    1. Much like SQL engines,. Pandas provide powerful tools for manipulating and transforming data. You can filter, sort, aggregate, and reshape data easily.

    2. It has functions to handle missing data, duplicate values, and data type conversions, making data-cleaning tasks more manageable.

    3. You can combine multiple data sources through operations like merging, joining, and concatenating DataFrames.

If you need to use Pandas locally, you'll need to install it first. You can install it using the following command in your Python environment:

pip install pandas

Tip: SQLAlchemy is very similar in functionality to Pandas but is a library that is geared more toward SQL users with object-centric thinking where SQL constructs like TABLE etc are first-class constructs. Even if you lov SQL, Pandas is important for you to learn.

Create a DataFrame

Execute the following piece of code and ensure that the CSV file path is specified correctly as mentioned earlier

import pandas as pd;

# Create a variable by reading the CSV file
data = pd.read_csv('Python101Data');
# Create a DataFrame
df = pd.DataFrame(data);

# Print the full contents of the DataFrame
print(df);

You should see the results as:

print(df) summarizes the results much like a SELECT * on a table.

Note the following:

  • import pandas: This part indicates that you want to use the functionality provided by the Pandas library in your program.

  • as pd: This part gives an alias to the imported library. By using pd, you're saying that instead of typing out "pandas" every time you want to use a Pandas function, you can use the shorter alias "pd."

  • DataFrame: As mentioned earlier, this is a class provided by the Pandas library that represents a tabular data structure, similar to a table in a database or a spreadsheet.

  • data: This is a variable that holds the data you want to use to create the DataFrame. It's usually in the form of a dictionary, where the keys represent the column names and the values represent the data for each column. Here the keys are Name, Age, and Gender.

  • print(df)will display the entire DataFrame by default if it's not too large. However, if the DataFrame is large, Pandas will display a summarized view, showing only a subset of rows and columns with an ellipsis (...) to indicate that there's more data not shown.

Show Statistics

Let us now execute:

print(df.describe());
Descriptive statistics of the DataFrame

The only column that will havee statistics is the id column.

df.describe() will generate statistics on the numerical columns of the DataFrame. This is very similar to ANALYZE TABLE command for computing statistics in Data Distiller.

  • count: The number of non-null values in each column.

  • mean: The mean (average) value of each column.

  • std: The standard deviation, which measures the dispersion of values around the mean.

  • min: The minimum value in each column.

  • 25%: The 25th percentile (also known as the first quartile).

  • 50%: The 50th percentile (also known as the median).

  • 75%: The 75th percentile (also known as the third quartile).

  • max: The maximum value in each column.

Preview the DataFrame

Let us try to preview the first 10 rows by executing:

print(df.head(10));
Preview the first 10 rows. We can change the parameter from 10 to higher or a lower number.

Aggregation Functions

Let us count the number of each gender type in the population

grouped_gender = df.groupby('gender').count();
print(grouped_gender);

Remember that grouped_gender is a DataFrame. When you use the groupby() function and then apply an aggregation function like count(), it returns a DataFrame with the counts of occurrences for each gender type. The above code is very similar to an aggregation COUNT with GROUP BY in SQL.

The answer that you will get should look like this:

Various types of gender present in the dataset.

Other functions that you could have used in place of count() are sum(), mean(), std(), var(), min(), max(), and median().

Define a Function

Let us create a function that computes the percentage of total for all the gender types

# Define the function
def percent_of_total(column):
  return 100*column/column.sum();

# Apply the function to the 'gender' column
percent_of_total_gender = percent_of_total(grouped_gender);
print(percent_of_total_gender);

Note the following:

  1. The hash sign # is used to create comments.

  2. Note that def has a semi-colon.

  3. return should be indented properly.

  4. The function percent_of_total is applied to each individual element in the column.

  5. percent_of_total_gender is also a DataFrame as will be obvious from the answers below.

The answers you will get will look like this:

Percentage of total computation.

To just retrieve a single column, let us use:

print(percent_of_total_gender['id']);

This gives

Percent of total result

Alternatively, we could have also created a Series object instead of a DataFrame for percent_of_total_gender

percent_of_total_gender = percent_of_total(grouped_gender['id']);
print(percent_of_total_gender);

And that would give us the exact same answer.

Let us persist these results that are a Series object into a new DataFram

percent_of_total_df = percent_of_total_gender.to_frame(name='Percentage);
print(percent_of_total_df);

Results are

New Dataframe object created from the Series object

Generate a Randomized Yearly Purchase Column

We are going to emulate the random number generation as in the example here:

Also, let us take this new column and add it to the DataFrame. Let us execute this code:

import random;

# Function to generate random purchases
def generate_random_purchases(column):
    return random.randint(1000, 10000)  # Modify the range as needed

# Apply the function to generate purchases for each row
df['YearlyPurchases'] = df['id'].apply(generate_random_purchases)

print(df)

The results show that a new column was added:

Adding a column to a DataFrame.

To learn more about thee random library, read this.

Visualize the Results

Let us make our first foray into visualizing the histogram:

import matplotlib.pyplot as viz;
viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.show();

matplotlib.pyplot is a visualization library in Python. It is unfortunate that it sounds very similar to MATLAB which also has plot commands. To plot a chart like the histogram, you can use this site as a reference.

The code is no different from what we used for creating a DataFrame. You first initialize a handle on a library and then access the functions within that library. The function names should be self-explanatory as to what they do.

The results look like this:

Histogram of gender type.

The histogram looks messy so let us clean this up:

import matplotlib.pyplot as viz;
viz.hist(df['gender'], bins=10, edgecolor='black');
viz.xlabel('Gender');
viz.ylabel('Frequency');
viz.title('Histogram of Gender');
viz.tick_params(axis='x', rotation=45)
viz.tight_layout()
viz.show();

We added two extra functions

  • The viz.tick_params(axis='x', rotation=45) rotates the x-labels by 45 degrees

  • The viz.tight_layout() improves the spacing and layout of the plot elements to avoid overlapping.

(Extra Credit) Advanced Visualizations

There is one last thing we want to do here. What if we wanted to plot a histogram and a bar graph together at the same time?

The answer is that if you have ever used MATLAB, the following code will seem similar:

# Create a 1x2 grid of plots
fig, axes = viz.subplots(1, 2, figsize=(12, 5))

# Histogram Plot
axes[0].hist(df['gender'], bins=10, edgecolor='black', color='skyblue')
axes[0].set_xlabel('Gender')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Gender')
axes[0].tick_params(axis='x', rotation=45)


# Bar Plot
gender_counts = df['gender'].value_counts()
axes[1].bar(gender_counts.index, gender_counts.values, color='salmon')
axes[1].set_xlabel('Gender')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Bar Plot of Gender')
axes[1].tick_params(axis='x', rotation=45)

# Adjust layout for better spacing
fig.tight_layout()

#Display
viz.show()

The results will look like:

Plotting two different visualizations side by side.

Note the following in the code:

  1. The heart of this code is fig, axes = viz.subplots(1, 2, figsize=(12, 5)) Much like MATLAB, this function call creates a grid of subplots in a single figure.

    • 1: The number of rows in the grid.

    • 2: The number of columns in the grid

    • figsize=(12, 5): This specifies the size of the entire figure in inches. (12, 5) means the figure will be 12 inches wide and 5 inches tall.

    • The function returns two objects:

      • fig: The figure object, which represents the entire figure.

      • axes: A 2D array of subplot axes. In this case, it's a 1x2 array, meaning there's one row and two columns of subplot axes.

  2. fig.tight.layout() is done at the entire figure level rather than individual charts. That is how this library has been designed.

(Extra Credit) Exploring the random library

Generating random data is a good skill to acquire especially in the world of data science. The random library in Python is a built-in module that provides functions for generating random numbers and performing various random operations.

Here are some of the key functions provided by the random library:

  1. Generating Random Numbers:

    • random.random(): Generates a random float between 0 and 1.

    • random.randint(a, b): Generates a random integer between a and b (inclusive).

    • random.uniform(a, b): Generates a random float between a and b.

  2. Generating Random Sequences:

    • random.choice(sequence): Returns a random element from the given sequence

    • random.sample(sequence, k): Returns a list of k unique random elements from the sequence.

    • random.shuffle(sequence): Shuffles the elements in the sequence randomly.

  3. Random Selection:

    • random.choices(population, weights=None, k=1): Returns a list of k elements randomly selected from the population, possibly with specified weights.

  4. Randomness Simulation:

    • random.seed(a=None): Initializes the random number generator with a seed. Providing the same seed will produce the same sequence of random numbers.

    • random.random() These functions generate pseudo-random numbers, which appear random but are actually determined by an initial state (seed) of the random number generator.

Here is some example code to try out:

import random

libraries = ["NumPy", "Pandas", "Matplotlib", "TensorFlow", "Scikit-learn", "PyTorch"]

# Choose a random library from the list
random_library_choice = random.choice(libraries)

# Choose 2 random libraries without replacement (no duplicates)
random_library_sequence = random.sample(libraries, 2)

# Shuffle the list of libraries in place
random.shuffle(libraries)
random_library_shuffle = libraries

print("Randomly selected library:", random_library_choice)
print("Randomly selected sequence of libraries:", random_library_sequence)
print("Shuffled list of libraries:", random_library_shuffle)

# Set the seed for reproducibility
seed_value = 23
# Initialize the random number generator
random.seed(seed_value)
# Generate random float between 0 and 1
random_numbers = [random.random() for i in range(1,10,1)]

# print the values
print("Random numbers:", random_numbers)
print("Random numbers:", random_numbers)
print("Random numbers generated with seed", random_numbers)

Remember the syntax for for loop

for element in iterable:
    # Code block to be executed for each element
    # Indentation is crucial in Python to define the block of code inside the loop

You can also use a for loop with the range() function to iterate over a sequence of numbers:

for number in range(1, 10, 1):
    print(number)
    # Code to be executed in each iteration
    # The starting value is 1 of the sequence and it will be included
    # The ending value of the sequence is 10 and it will be excluded 
    # The step size between each number. It's optional; the default step is 1.

Appendix

Other Important Python Libraries

  1. Scientific

    1. NumPy: Similar to MATLAB. A fundamental package for scientific computing with Python. It provides support for arrays and matrices, along with mathematical functions to operate on these structures efficiently.

  2. Machine Learning

    • Scikit-learn: A machine learning library that provides simple and efficient tools for data mining and data analysis. It includes a wide variety of machine-learning algorithms and tools for tasks like classification, regression, clustering, and more.

    • TensorFlow: An open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, especially neural networks.

    • PyTorch: Another popular open-source machine learning framework, developed by Facebook's AI Research lab. It's known for its dynamic computation graph and ease of use in building neural networks. It is very popular in the research community.

  3. SQLAlchemy: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It simplifies database interactions and allows you to work with databases in a more Pythonic way. This is required for Data Distiller.

  4. Requests: A simple and elegant HTTP library for making HTTP requests to interact with web services and APIs. This is useful for working with Adobe Experience Platform APIs.

Download Notebook Code

Last updated