EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
Learn to Navigate Data Efficiently with Incremental Extraction Using Data Distiller Cursors
Prerequisites
Load the sample dataset
using the following tutorial:
PREP 500: Ingesting CSV Data into Adobe Experience PlatformCase Study Overview
The dataset is very simple and has 5000 records.
id
: Unique identifier for each entry.name
: Name of the individual or entity.value
: Numeric value associated with each entry.
We’ll be using the Data Distiller Query Pro Mode Editor, a web-based editor that lets you view results of a query with limited options- 50, 100, 150, 300, or 500. While this may seem limiting, it’s important to remember that the editor operates within your browser, which has memory constraints for displaying results.
If you use a dedicated client installed locally on your machine, such as DBVisualizer, you can handle much larger datasets—up to 100,000 rows or more. However, even with local clients, you’ll eventually hit application memory limits.
This brings us to an interesting challenge: how can we efficiently paginate through the result set when the client editor imposes a limit on how much data can be displayed at once?
The answer is Data Distiller Cursors.

What is a Data Distiller Cursor?
A cursor in Data Distiller is a database object used to retrieve, manipulate, and traverse through a set of rows returned by a query one row at a time. Cursors are particularly useful when you need to process each row individually, allowing for row-by-row processing that can be controlled programmatically.
Sequential Processing: Cursors allow you to sequentially access rows in a result set, which is helpful when you need to handle each row individually.
Row-by-Row Operations: Unlike standard SQL, which typically processes whole result sets, a cursor can fetch a limited number of rows at a time, letting you work with rows individually in a controlled manner.
Memory Efficiency: When working with very large datasets, fetching rows in smaller chunks (instead of loading all at once) can help manage memory usage and improve performance.
Common Uses of Cursors
Batch Processing: When you need to process rows in smaller batches, especially with large datasets.
Row-by-Row Operations: For complex operations that require checking or modifying each row one at a time.
Data Migration or Transformation: Cursors can be helpful when copying data from one table to another while applying transformations.
Procedural Logic: Used in stored procedures or scripts where specific row-based logic or conditions need to be applied iteratively.
How a Data Distiller Cursor Works
DECLARE: Defines the cursor
data_cursor
with a specific SQL query.FETCH: Retrieves a specified number of rows (e.g., 5) from the cursor.
CLOSE: Releases the cursor when no longer needed.
This process is especially valuable when working with large datasets, as it helps control memory usage by processing smaller chunks of data at a time.
Declare a Data Distiller Cursor
Before you start, you need to open the Data Distiller Query Pro Mode Editor by navigating to
AEP UI->Queries->Create Query
Make sure you choose the Show Results option as 500 as we will use this limit to paginate through the results.
We will declare a cursor to select all rows from the sample_data
table. This cursor will allow us to retrieve a limited number of rows at a time.
-- Declare the cursor
DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;

Once you've declared the cursor, open it to prepare it for row fetching.
Fetch Rows Using the Cursor
You can now fetch rows in batches of 500 rows. This is particularly useful if you're working with large datasets and want to process data in smaller chunks.
-- Fetch the first 500 rows
FETCH 500 FROM data_cursor;
You should see the first 500 rows i.e. observe the ID
column

Now let us try and fetch the next 500 rows and observe the ID
column
-- To fetch the next 500 rows, repeat this command
FETCH 500 FROM data_cursor;

Close the Cursor
After you’ve fetched all the rows you need, close the cursor to free up resources.
-- Close the cursor when done
CLOSE data_cursor;

The entire code for your reference and templatizing to others:
-- Declare the cursor
DECLARE data_cursor CURSOR FOR
SELECT id, name, value FROM sample_dataset;
-- Fetch the first 500 rows
FETCH 500 FROM data_cursor;
-- To fetch the next 500 rows, repeat this command
FETCH 500 FROM data_cursor;
-- Close the cursor when done
CLOSE data_cursor;
Appendix: Cursors in Python
The example in the tutorial below allows you to extract a chunk of rows at a time:
STATSML 100: Python & JupyterLab Setup for Data DistillerIf you wanted to persist the results in that tutorial incrementally:
import psycopg2
import csv
# Establish a connection to the database
conn = psycopg2.connect('''sslmode=require host=ZZZZ port=80 dbname=prod:all user=YYYYY@AdobeOrg password=XXXXX''')
# Create a cursor object for executing SQL commands
cursor = conn.cursor()
# Example query
query = "SELECT * FROM movie_data;"
cursor.execute(query)
# File to save the data
output_file = "movie_data.csv"
# Open the file in write mode
with open(output_file, mode='w', newline='', encoding='utf-8') as file:
csv_writer = None # Initialize the CSV writer variable
chunk_size = 50
while True:
# Fetch the results in chunks
chunk = cursor.fetchmany(chunk_size)
# Break the while loop if there are no rows to be fetched
if not chunk:
break
# Write the header row only once, when processing the first chunk
if csv_writer is None:
column_names = [desc[0] for desc in cursor.description]
csv_writer = csv.writer(file)
csv_writer.writerow(column_names) # Write the header row
# Write each row of the chunk to the CSV file
csv_writer.writerows(chunk)
# Close the cursor and connection
cursor.close()
conn.close()
print(f"Data has been successfully written to {output_file}.")
Last updated