STATSML 100: Python & JupyterLab Setup for Data Distiller
Learn how to setup Python and JupyterLab to connect to Data Distiller.
Last updated
Learn how to setup Python and JupyterLab to connect to Data Distiller.
Last updated
Even if you are a seasoned data scientist, it is recommended that you install DBVisualizer so that you can execute your queries directly on the Data Distiller database rather than having to query through python. The query strings you obtain from DBVisualizer can be pasted back into Python making the data access simpler.
Also, to execute the sample code, you will need a dataset in Adobe Experience Platform. Follow the steps to upload the dataset here:
Download the free version of Python and its associated libraries for your operating system here:
There is a wizard that will walk you through the steps of the installation here:
Remember that you will be using pip to download packages for your Python distribution as the use cases may require. It is not necessary that you have all the packages installed. You can do this anytime. pip (Package Installer for Python) is already included in the Anaconda distribution:
Follow the instructions here:
At the Terminal window on Mac or Command Prompt in Windows. Navigate to the directory where you want to start JupyterLab and then type:
It should look like this:
Warning: When you launch a Jupyterlab notebook, make sure you are in the directory where you can read and write files. So change your directory to that location and then launch it there.
In your notebook, you need to establish a connection. This is done using the following code:
The information for these fields will be available by navigating to Data Management->Queries->Credentials.
Click and copy the fields over to your connection strings.
Tip: When you copy and paste the password, it will be long - so do not panic!
Warning: Remember that <PASSWORD> above should be replaced by the string value of the password. Do not keep the < and >.
Execute the following code:
These are the results that you should see:
Note the following:
psycopg2
module is a popular PostgreSQL adapter for Python. It allows Python programs to interact with PostgreSQL databases, enabling you to perform various database operations such as connecting to a database, executing SQL queries, and more.
ZZZZ
is the name of the host
YYYY
is the username which is the same as the IMSOrg name
XXXX
is the password.
In the context of database programming, a "cursor" is an object that allows you to interact with the database and manage the execution of SQL queries. The line of code cursor = conn.cursor()
is used to create a cursor object associated with a database connection (conn
).
query = "SELECT * FROM movie_data;"
: This line prepares the SQL query to select all columns from the movie_data
table.
cursor.execute(query)
: This line executes the SQL query using the cursor. It sends the query to Data Distiller which begins processing the query and fetching the results.
Fetching results in chunks:
Note that the query is executed once at the beginning to retrieve all the results, and then you're fetching those results in chunks using fetchmany()
in the loop.
Keep in mind that fetchmany()
retrieves already-fetched rows from the cursor's internal buffer, so you're not re-executing the query with each call to fetchmany()
. You're simply fetching chunks of data that have already been retrieved from the database.
The loop starts, and chunk = cursor.fetchmany(chunk_size)
fetches a chunk of rows from the result set. The number of rows fetched is determined by the chunk_size
parameter.
If there are still rows left in the result set, the loop continues to the next iteration, fetching the next chunk.
If there are no more rows left to fetch, cursor.fetchmany()
returns an empty list, which leads to the condition if not chunk:
being satisfied.
The loop breaks and the fetching process stops.