Page cover

STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python

This chapter covers the essential steps for installing necessary libraries, generating access tokens, and making authenticated API requests.

Use Case Overview

Our objective is to create a snapshot of the datasets available in the Adobe Experience Platform (AEP) catalog and enrich this snapshot with additional dimensions such as the number of rows, dataset size, and whether the dataset is managed by the customer or the system. While some of this information is available through the AEP UI under the Datasets tab, our goal is to extract this data programmatically, store it in a data lake, and leverage Data Distiller for advanced slice-and-dice analysis.

By doing so, we will be able to track the growth of datasets over time, with a particular focus on those marked for the Profile Store and Identity Store. This use case falls under the category of metadata analytics, involving potentially hundreds or thousands of datasets. Although AEP exposes some of this metadata via system datasets, our requirements demand a customized view of the base dataset. In such scenarios, a Python-based solution to extract, format, and ingest the data into the platform becomes a highly effective approach.

REST APIs in the Platform

A fundamental aspect of Adobe Experience Platform's architecture is its use of modular services that interact through REST APIs to perform tasks. This service-oriented approach forms the backbone of the platform. So, if you're unable to find specific data or information in the UI or product documentation, the REST APIs should be your go-to resource. In essence, every UI feature and function is powered by underlying API calls.

Adobe Experience Platform APIs: This is the page that you need be keeping a watch on.

My approach to working with APIs begins by creating a Developer Project, obtaining the necessary credentials and IMS organization details. I then use Python to extract, process, and format the data. Finally, I leverage Python's Postgres SQL connector to write the data back into Adobe Experience Platform. I could use Postman as well, but having spent a significant amount of time working at MathWorks, I have a deep appreciation for both MATLAB and Python. This is why I tend to favor Python for my workflow—it’s just a personal preference!

Python vs Postman: Choosing the Right Tool for REST API Interactions

Both Python and Postman offer powerful ways to interact with REST APIs, but they serve different purposes depending on the context. Postman is ideal for quick testing, debugging, and prototyping of API calls. It provides an intuitive user interface for crafting requests, viewing responses, and managing collections of API calls without needing to write code. This makes it particularly useful for developers during the early stages of API development or testing, as well as for non-developers who need to work with APIs. However, Python excels when you need to integrate API calls into a larger workflow, automate tasks, or manipulate the response data programmatically. With libraries like requests, Python enables greater flexibility and customization, allowing for complex logic, error handling, and scheduled tasks. The trade-off is that Python requires more setup and knowledge of coding, making it more suitable for repeatable, automated processes rather than ad-hoc testing. In summary, Postman shines in quick, manual interactions with APIs, while Python is the go-to for automation and advanced API workflows.

Prerequisites

STATSML 100: Python & JupyterLab Setup for Data DistillerPREP 500: Ingesting CSV Data into Adobe Experience PlatformBI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding

Before You Create a Project

Reach out to an Admin Console administrator within your organization and request to be added as a developer to an Experience Platform product profile through the Admin Console. Also make sure that and a user for a product profile. Without this permission, you will not be able to create a project or move any further with the APIs.

Once you are done creating a project, you will be generating the following that you will use to gain access:

  1. {ACCESS_TOKEN}: This is a short-lived token used to authenticate API requests. It is obtained through an OAuth flow or Adobe's IMS (Identity Management System) and is required for authorization to interact with AEP services. However, the {ACCESS_TOKEN} must be refreshed every 24 hours, as it is a short-lived token. This means that each day you need to request a new access token using your API credentials to continue making authenticated API requests to Adobe Experience Platform.

    1. {API_KEY}: Also known as the {CLIENT ID}, this is a unique identifier assigned to your application when you create a Developer Project in Adobe Developer Console. It is used to authenticate and identify the application making the API requests.

    2. {CLIENT SECRET}: The Client Secret is used along with the Client ID to authenticate your application when requesting an access token. It acts as a password for your app and should be kept secure. When using OAuth Server-to-Server authentication, you need both the Client ID and Client Secret to securely generate the access token.

    3. {SCOPES}define the permissions your application is requesting. They specify which Adobe services or APIs your access token can interact with. Without defining the appropriate scopes, your token may not have the necessary access to certain resources in the platform.

    4. The access token is generated after successful authentication using the Client ID, Client Secret, and Scopes. It is then used in each API request to authorize the interaction.

    5. {ORG_ID}: The Organization ID (IMS Organization ID) uniquely identifies your Adobe Experience Platform organization. This ID is required in API requests to specify which organization’s data you are interacting with

    6. {SANDBOX_NAME}: Some APIs may require you to have the sandbox name identify the specific sandbox environment you are working in within Adobe Experience Platform. Sandboxes allow you to segment and manage different environments (such as development, staging, and production) to safely test, develop, and deploy features without affecting live data.

    7. TECHNICAL_ACCOUNT_ID belongs to a machine account (not a human user), it is used when your application needs to perform automated tasks, such as fetching data or executing processes in Adobe Experience Platform, without any user interaction.

You will first need to generate a token very 24 hours by using the Client ID, Secret and Scopes.
Subsequent calls will use the access token and other parameters along with the API call to get the results. It does not require the technical account or the client secret anymore.

Stuff I Wish They Told Me

What is an access token really?

It grants the bearer permission to interact with the AEP services for a limited time (typically 24 hours). The token includes encoded information about the user or application, the permissions granted, and its expiration time.

In simple terms, the access token serves as a "proof of identity" that you or your application can present when calling AEP APIs, ensuring that only authorized users can perform actions or retrieve data from the platform.

Why do I need a API key if I have the access token?

When we are making API calls from Python (or any other client), both the API key and the access token play different roles:

  1. API Key Tracking

When you make API calls, the API key (also known as the Client ID) is sent along with your requests. This helps Adobe track which application is making the requests. The API key is typically passed in the headers or query parameters of each request. It allows Adobe to monitor usage, enforce rate limits, and tie your requests to the specific developer project associated with your API key.

  1. Access Token for Every API Request:

The access token is used in every single API request to authenticate and authorize your actions. It's sent in the headers of your API calls, typically in the form of a "Bearer" token (a kind of token used in HTTP authentication). Without a valid access token, your API calls will be rejected because the platform needs to confirm your identity and the scope of permissions granted to you.

Wait, where are these API calls going in the Adobe system?

When you make API requests to Adobe Experience Platform (AEP) or even other Adobe services, those requests are routed through Adobe I/O, Adobe’s central API gateway.

Adobe I/O is not a traditional web server; it functions as an API gateway and integration platform, providing a centralized access point. It works with Adobe’s Identity Management System (IMS) to validate your API keys and access tokens, ensuring you have the proper permissions to access the service. Once authenticated, Adobe I/O directs your requests to the correct Adobe service, such as AEP or Analytics.

Additionally, Adobe I/O manages traffic by enforcing rate limits to ensure fair usage and protect Adobe’s infrastructure.

What is this OAuth Server-to-Server?

OAuth is a server-to-server security protocol that lets you grant limited access to your data or services to an app (the Python client we will use) without giving away your personal login details. Instead of sharing your password, you authorize the app to get an access token, which the app (our Python client) then uses to make secure API requests. OAuth generates access tokens that are short-lived and easily revoked, reducing the risk if a token is compromised.

Create a Developer Project

Here are the steps to create a project

  1. You will be greeted on a a screen that looks like this. Click on Create Project

  1. The new Project screen will look like this:

New Projext screen
  1. Add API and choose Adobe Experience Platform and Experience Platform API

Choose the APIs
  1. Click on Next and you will see the following screen. Choose OAuth Server-to-Server

Choose your authentication protocol
  1. Now you need to choose the product profiles. Product profile lets you define what products you have access to. The product profile is typically created by the administrator at https://adminconsole.adobe.com/.

Choosing a product profile

Previously, permissions for role-based access control on AEP objects, such as dataset access, were managed within the product profile. However, this has now changed with Permissions being introduced as a separate feature in the left navigation panel of the AEP UI. It's important to confirm with your admin that you've selected a product profile that grants you developer access to Adobe Experience Platform.

  1. Click Save the configured API and you will see the following. There will be a API Key that you can copy. You will also get an option to Generate access token.

Generate the API key and the access token
  1. You can also the edit the project if you like to change the name or description

Change the name or description if you need to.
  1. Also, if you clicked on OAuth Server-to-Server, you will be able to see new information such as CLIENT ID, CLIENT SECRET and SCOPES. Copy them. You will use this information to generate the access token every 24 hours to connect to the Platform APIs. Alternatively, you can log into this project everyday and generate the access token. Make sure you look into the SCOPES to make sure you have the right SCOPE for the service that you are querying into.

On the project screen, you will get extra details on client ID, client secret and scopes.
  1. The API requests will be sent to an endpoint. To access that endpoint, click on the View cURL command.

Getting the endpoint where we will send the API requests to.
  1. If you scroll down, you will get additional details on ORG_ID and TECHNICAL_ACCOUNT_ID which you should copy. Note the Credential name. You will need this later in the Permissions UI.

More details available on scroll.

11. Go to the AEP UI or have the admin add a role with the right permissions to your Technical Account ID. Click on Permissions->Roles->Create Role

Click on Create Role
  1. Name the Role.

Name the role just to track it in the system
  1. Add the All permission in Data Management and make sure you choose the right Sandboxes. Also add all permissions for Sandbox Administration, Data Ingestion and Query Service

Add all the permissions and the sandboxes
  1. Scroll to API Credentials. The credential name is the same one that we saw in the project. Click on the credential name from your project and go to Roles and add the role you created.

Add a role
Choose the role you created

Get the Access Token

  1. Let us write some Python code to do so. Copy paste the following in JupyterLab and make sure you have all the parameters from the previous section.

  1. You should see a response that looks like this:

JupyterLab output

Retrieve Datasets Information Across All Sandboxes

  1. We will use the Sandbox Management APIs to retrieve the list of sandboxes. Then we will loop through these sandboxes and make calls to the Catalog API to get the dataset information.

  2. In a separate cell, copy paste the following and make sure the cell containing the variables for the headers are executed from the previous section

  1. You will see output in your Python environment that looks like this:

  1. Open up datasets.json in a notepad-like application and you should see something similar to this. Use the JSON file to get a sense of the various fields and their values. We do not need all of these values.

JSON file opened in Visual Studio Code

The most important part of the above code is the pagination code for getting the sandboxes and the datasets (in sets of 50). If you miss the pagination code, your answers will be wrong. It also helps to print the datasets in a sandbox to compare your results.

Data Processing of JSON into a Flat File

  1. Copy paste and execute the following code:

  1. The result should look something similar to like this in the editor:

Tip: If you try to ingest this CSV file, you will need to use a source schema that has no spaces in its column names. If you use the manual CSV upload workflow, you will need to reformat the columns names to exclude the spaces. But this is what real life is - dirty column names. That is the reason why we are being extra cautious in how we do our naming.

Get Row and Count Statistics on the Datasets

1, To generate the statistics of the dataset, you can use the Statistics endpoint:

  1. The results will be the following:

Upload the CSV into Data Landing Zone

We will now upload this CSV into the Data Landing Zone with the expectation that a schema and a data flow was created as per the following prerequisite tutorial.

  1. You need to extract the SAS URI from the AEP UI. Go to Sources->Catalog->Data Landing Zone. Click on the card gently and a right panel will appear. Scroll down to get the SAS URI. Note that the SAS URI already contains the SAS Token

SAS URI and SAS Token from the AEP UI
  1. Just copy paste the SAS URI in the following code for the variable full_sas_url variable. Also observe the csv_file_path variable and how it is being injected into the SAS URI in the variable sas_url_with_file

  1. The response you will see is the following:

  1. If you click into the Data Landing Zone i.e. Sources->Catalog->Data Landing Zone->Add data, you will see:

CSV file has been uploaded into Data Landing Zone

Tip: The great thing about this approach is that if we keep running this Python notebook on schedule, then the files dropped into Data Landing Zone will be picked up by the dataflow runs.

Analysis in Python

Number of Datasets by Sandbox and Ownership

  1. Execute thee following piece of code

  1. The result is:

Number of datasets broken down by sandbox and ownership

Total Volume Used in GB Across All Sandboxes Split By Ownership

  1. Copy paste and execute the following piece of code

  1. The results will be:

Data volume split by ownership

Top 10 Datasets by Size in GB

  1. Remember the size is in bytes and needs to be converted to GB or TB. Copy paste the following piece of code and execute:

  1. The result will be:

The chart reveals a long tail indicating fragmentation

Retrieve a List of Dataset Sizes and Rows by Sandbox

  1. Copy paste and execute the following piece of code

  1. The result will look like

Average Record Richness by Sandbox (bytes per record)

  1. Copy paste and execute the following piece of code:

  1. The result will look like:

Histogram of Dataset Sizes Across All Sandboxes

  1. Copy paste and execute the following code

  1. The result will look like:

Histogram is showing typical distribution for a demo environment

Profile-Enabled Datasets (GB) By Sandbox

  1. Copy paste and execute the following code:

  1. The result will look like:

User Contributions & Improvements

David Teko Kangni

Many thanks to David Teko Kangni modularized the code and also fixed the warnings which I had intentionally not fixed:

Last updated