STATSML 100: Python & JupyterLab Setup for Data Distiller

Learn how to setup Python and JupyterLab to connect to Data Distiller.

Prerequisites

Even if you are a seasoned data scientist, it is recommended that you install DBVisualizer so that you can execute your queries directly on the Data Distiller database rather than having to query through python. The query strings you obtain from DBVisualizer can be pasted back into Python making the data access simpler.

Also, to execute the sample code, you will need a dataset in Adobe Experience Platform. Follow the steps to upload the dataset here:

Install Python: Anaconda Distribution

Download the free version of Python and its associated libraries for your operating system here:

Download Anaconda Distribution | AnacondaAnaconda

There is a wizard that will walk you through the steps of the installation here:

Anaconda Documentation - AnacondaAnaconda

Remember that you will be using pip to download packages for your Python distribution as the use cases may require. It is not necessary that you have all the packages installed. You can do this anytime. pip (Package Installer for Python) is already included in the Anaconda distribution:

pip documentation v25.2pip.pypa.io

Install JupyterLab as the Data Science Editor

Follow the instructions here:

Project Jupyterjupyter.org

Launch JupyterLab

At the Terminal window on Mac or Command Prompt in Windows. Navigate to the directory where you want to start JupyterLab and then type:

jupyter lab

It should look like this:

Warning: When you launch a Jupyterlab notebook, make sure you are in the directory where you can read and write files. So change your directory to that location and then launch it there.

Connect to Data Distiller

In your notebook, you need to establish a connection. This is done using the following code:

import psycopg2;
conn = psycopg2.connect('''sslmode=require host=<YOUR_HOST_CREDENTIAL> port=80 dbname=prod:all user=<USERNAME> password=<YOUR_PASSWORD>''')

The information for these fields will be available by navigating to Data Management->Queries->Credentials.

Click and copy the fields over to your connection strings.

Tip: When you copy and paste the password, it will be long - so do not panic!

Warning: Remember that <PASSWORD> above should be replaced by the string value of the password. Do not keep the < and >.

Execute Sample Code

Execute the following code:

import psycopg2;

# Establish a connection to the database
conn = psycopg2.connect('''sslmode=require host=ZZZZ port=80 dbname=prod:all user=YYYYY@AdobeOrg password=XXXXX''')

# Create a cursor object for executing SQL commands
cursor = conn.cursor()

# Example query
query = "SELECT * FROM movie_data;"
cursor.execute(query)

# Fetch all the results
# results = cursor.fetchall()

# Fetch the results in chunks
chunk_size = 50
while True:
    # chunk is a cursor that fetches 50 rows at a time. It is not a running index.
    chunk = cursor.fetchmany(chunk_size);
    # Break the while loop if there are no rows to be fetced
    if not chunk:
        break
    # Print each row. Print by default is a newline.
    for row in chunk:
        print(row);

# Close the cursor and connection
cursor.close()
conn.close()

These are the results that you should see:

Note the following:

psycopg2 module is a popular PostgreSQL adapter for Python. It allows Python programs to interact with PostgreSQL databases, enabling you to perform various database operations such as connecting to a database, executing SQL queries, and more.
ZZZZ is the name of the host
YYYY is the username which is the same as the IMSOrg name
XXXX is the password.
In the context of database programming, a "cursor" is an object that allows you to interact with the database and manage the execution of SQL queries. The line of code cursor = conn.cursor() is used to create a cursor object associated with a database connection (conn).
query = "SELECT * FROM movie_data;": This line prepares the SQL query to select all columns from the movie_data table.
cursor.execute(query): This line executes the SQL query using the cursor. It sends the query to Data Distiller which begins processing the query and fetching the results.
Fetching results in chunks:
- Note that the query is executed once at the beginning to retrieve all the results, and then you're fetching those results in chunks using fetchmany() in the loop.
- Keep in mind that fetchmany() retrieves already-fetched rows from the cursor's internal buffer, so you're not re-executing the query with each call to fetchmany(). You're simply fetching chunks of data that have already been retrieved from the database.
- The loop starts, and chunk = cursor.fetchmany(chunk_size) fetches a chunk of rows from the result set. The number of rows fetched is determined by the chunk_size parameter.
- If there are still rows left in the result set, the loop continues to the next iteration, fetching the next chunk.
- If there are no more rows left to fetch, cursor.fetchmany() returns an empty list, which leads to the condition if not chunk: being satisfied.
- The loop breaks and the fetching process stops.

Last updated 1 year ago