EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors

Learn to Navigate Data Efficiently with Incremental Extraction Using Data Distiller Cursors

Prerequisites

Load the sample dataset

693B

sample_data.csv

using the following tutorial:

PREP 500: Ingesting CSV Data into Adobe Experience Platform

Case Study Overview

The dataset is very simple and has 5000 records.

id: Unique identifier for each entry.
name: Name of the individual or entity.
value: Numeric value associated with each entry.

We’ll be using the Data Distiller Query Pro Mode Editor, a web-based editor that lets you view results of a query with limited options- 50, 100, 150, 300, or 500. While this may seem limiting, it’s important to remember that the editor operates within your browser, which has memory constraints for displaying results.

If you use a dedicated client installed locally on your machine, such as DBVisualizer, you can handle much larger datasets—up to 100,000 rows or more. However, even with local clients, you’ll eventually hit application memory limits.

This brings us to an interesting challenge: how can we efficiently paginate through the result set when the client editor imposes a limit on how much data can be displayed at once?

The answer is Data Distiller Cursors.

What is a Data Distiller Cursor?

A cursor in Data Distiller is a database object used to retrieve, manipulate, and traverse through a set of rows returned by a query one row at a time. Cursors are particularly useful when you need to process each row individually, allowing for row-by-row processing that can be controlled programmatically.

Sequential Processing: Cursors allow you to sequentially access rows in a result set, which is helpful when you need to handle each row individually.
Row-by-Row Operations: Unlike standard SQL, which typically processes whole result sets, a cursor can fetch a limited number of rows at a time, letting you work with rows individually in a controlled manner.
Memory Efficiency: When working with very large datasets, fetching rows in smaller chunks (instead of loading all at once) can help manage memory usage and improve performance.

Common Uses of Cursors

Batch Processing: When you need to process rows in smaller batches, especially with large datasets.
Row-by-Row Operations: For complex operations that require checking or modifying each row one at a time.
Data Migration or Transformation: Cursors can be helpful when copying data from one table to another while applying transformations.
Procedural Logic: Used in stored procedures or scripts where specific row-based logic or conditions need to be applied iteratively.

How a Data Distiller Cursor Works

DECLARE: Defines the cursor data_cursor with a specific SQL query.
FETCH: Retrieves a specified number of rows (e.g., 5) from the cursor.
CLOSE: Releases the cursor when no longer needed.

This process is especially valuable when working with large datasets, as it helps control memory usage by processing smaller chunks of data at a time.

Declare a Data Distiller Cursor

Before you start, you need to open the Data Distiller Query Pro Mode Editor by navigating to

AEP UI->Queries->Create Query

Make sure you choose the Show Results option as 500 as we will use this limit to paginate through the results.

We will declare a cursor to select all rows from the sample_data table. This cursor will allow us to retrieve a limited number of rows at a time.

-- Declare the cursor
DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;

Once you've declared the cursor, open it to prepare it for row fetching.

Fetch Rows Using the Cursor

You can now fetch rows in batches of 500 rows. This is particularly useful if you're working with large datasets and want to process data in smaller chunks.

-- Fetch the first 500 rows
FETCH 500 FROM data_cursor;

You should see the first 500 rows i.e. observe the ID column

Now let us try and fetch the next 500 rows and observe the ID column

-- To fetch the next 500 rows, repeat this command
FETCH 500 FROM data_cursor;

Close the Cursor

After you’ve fetched all the rows you need, close the cursor to free up resources.

-- Close the cursor when done
CLOSE data_cursor;

The entire code for your reference and templatizing to others:

-- Declare the cursor
DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;

-- Fetch the first 500 rows
FETCH 500 FROM data_cursor;

-- To fetch the next 500 rows, repeat this command
FETCH 500 FROM data_cursor;

-- Close the cursor when done
CLOSE data_cursor;

Appendix: Cursors in Python

The example in the tutorial below allows you to extract a chunk of rows at a time:

STATSML 100: Python & JupyterLab Setup for Data Distiller

If you wanted to persist the results in that tutorial incrementally:

import psycopg2
import csv

# Establish a connection to the database
conn = psycopg2.connect('''sslmode=require host=ZZZZ port=80 dbname=prod:all user=YYYYY@AdobeOrg password=XXXXX''')

# Create a cursor object for executing SQL commands
cursor = conn.cursor()

# Example query
query = "SELECT * FROM movie_data;"
cursor.execute(query)

# File to save the data
output_file = "movie_data.csv"

# Open the file in write mode
with open(output_file, mode='w', newline='', encoding='utf-8') as file:
    csv_writer = None  # Initialize the CSV writer variable
    chunk_size = 50

    while True:
        # Fetch the results in chunks
        chunk = cursor.fetchmany(chunk_size)

        # Break the while loop if there are no rows to be fetched
        if not chunk:
            break

        # Write the header row only once, when processing the first chunk
        if csv_writer is None:
            column_names = [desc[0] for desc in cursor.description]
            csv_writer = csv.writer(file)
            csv_writer.writerow(column_names)  # Write the header row

        # Write each row of the chunk to the CSV file
        csv_writer.writerows(chunk)

# Close the cursor and connection
cursor.close()
conn.close()

print(f"Data has been successfully written to {output_file}.")

Last updated 9 months ago