Adobe Data Distiller Guide
  • Adobe Data Distiller Guide
  • What is Data Distiller?
  • UNIT 1: GETTING STARTED
    • PREP 100: Why was Data Distiller Built?
    • PREP 200: Data Distiller Use Case & Capability Matrix Guide
    • PREP 300: Adobe Experience Platform & Data Distiller Primers
    • PREP 301: Leveraging Data Loops for Real-Time Personalization
    • PREP 302: Key Topics Overview: Architecture, MDM, Personas
    • PREP 303: What is Data Distiller Business Intelligence?
    • PREP 304: The Human Element in Customer Experience Management
    • PREP 305: Driving Transformation in Customer Experience: Leadership Lessons Inspired by Lee Iacocca
    • PREP 400: DBVisualizer SQL Editor Setup for Data Distiller
  • PREP 500: Ingesting CSV Data into Adobe Experience Platform
  • PREP 501: Ingesting JSON Test Data into Adobe Experience Platform
  • PREP 600: Rules vs. AI with Data Distiller: When to Apply, When to Rely, Let ROI Decide
  • Prep 601: Breaking Down B2B Data Silos: Transform Marketing, Sales & Customer Success into a Revenue
  • Unit 2: DATA DISTILLER DATA EXPLORATION
    • EXPLORE 100: Data Lake Overview
    • EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
    • EXPLORE 200: Exploring Behavioral Data with Data Distiller - A Case Study with Adobe Analytics Data
    • EXPLORE 201: Exploring Web Analytics Data with Data Distiller
    • EXPLORE 202: Exploring Product Analytics with Data Distiller
    • EXPLORE 300: Exploring Adobe Journey Optimizer System Datasets with Data Distiller
    • EXPLORE 400: Exploring Offer Decisioning Datasets with Data Distiller
    • EXPLORE 500: Incremental Data Extraction with Data Distiller Cursors
  • UNIT 3: DATA DISTILLER ETL (EXTRACT, TRANSFORM, LOAD)
    • ETL 200: Chaining of Data Distiller Jobs
    • ETL 300: Incremental Processing Using Checkpoint Tables in Data Distiller
    • [DRAFT]ETL 400: Attribute-Level Change Detection in Profile Snapshot Data
  • UNIT 4: DATA DISTILLER DATA ENRICHMENT
    • ENRICH 100: Real-Time Customer Profile Overview
    • ENRICH 101: Behavior-Based Personalization with Data Distiller: A Movie Genre Case Study
    • ENRICH 200: Decile-Based Audiences with Data Distiller
    • ENRICH 300: Recency, Frequency, Monetary (RFM) Modeling for Personalization with Data Distiller
    • ENRICH 400: Net Promoter Scores (NPS) for Enhanced Customer Satisfaction with Data Distiller
  • Unit 5: DATA DISTILLER IDENTITY RESOLUTION
    • IDR 100: Identity Graph Overview
    • IDR 200: Extracting Identity Graph from Profile Attribute Snapshot Data with Data Distiller
    • IDR 300: Understanding and Mitigating Profile Collapse in Identity Resolution with Data Distiller
    • IDR 301: Using Levenshtein Distance for Fuzzy Matching in Identity Resolution with Data Distiller
    • IDR 302: Algorithmic Approaches to B2B Contacts - Unifying and Standardizing Across Sales Orgs
  • Unit 6: DATA DISTILLER AUDIENCES
    • DDA 100: Audiences Overview
    • DDA 200: Build Data Distiller Audiences on Data Lake Using SQL
    • DDA 300: Audience Overlaps with Data Distiller
  • Unit 7: DATA DISTILLER BUSINESS INTELLIGENCE
    • BI 100: Data Distiller Business Intelligence: A Complete Feature Overview
    • BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
    • BI 300: Dashboard Authoring with Data Distiller Query Pro Mode
    • BI 400: Subscription Analytics for Growth-Focused Products using Data Distiller
    • BI 500: Optimizing Omnichannel Marketing Spend Using Marginal Return Analysis
  • Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING
    • STATSML 100: Python & JupyterLab Setup for Data Distiller
    • STATSML 101: Learn Basic Python Online
    • STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
    • STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting
    • STATSML 300: AI & Machine Learning: Basic Concepts for Data Distiller Users
    • STATSML 301: A Concept Course on Language Models
    • STATSML 302: A Concept Course on Feature Engineering Techniques for Machine Learning
    • STATSML 400: Data Distiller Basic Statistics Functions
    • STATSML 500: Generative SQL with Microsoft GitHub Copilot, Visual Studio Code and Data Distiller
    • STATSML 600: Data Distiller Advanced Statistics & Machine Learning Models
    • STATSML 601: Building a Period-to-Period Customer Retention Model Using Logistics Regression
    • STATSML 602: Techniques for Bot Detection in Data Distiller
    • STATSML 603: Predicting Customer Conversion Scores Using Random Forest in Data Distiller
    • STATSML 604: Car Loan Propensity Prediction using Logistic Regression
    • STATSML 700: Sentiment-Aware Product Review Search with Retrieval Augmented Generation (RAG)
    • STATSML 800: Turbocharging Insights with Data Distiller: A Hypercube Approach to Big Data Analytics
  • UNIT 9: DATA DISTILLER ACTIVATION & DATA EXPORT
    • ACT 100: Dataset Activation with Data Distiller
    • ACT 200: Dataset Activation: Anonymization, Masking & Differential Privacy Techniques
    • ACT 300: Functions and Techniques for Handling Sensitive Data with Data Distiller
    • ACT 400: AES Data Encryption & Decryption with Data Distiller
  • UNIT 9: DATA DISTILLER FUNCTIONS & EXTENSIONS
    • FUNC 300: Privacy Functions in Data Distiller
    • FUNC 400: Statistics Functions in Data Distiller
    • FUNC 500: Lambda Functions in Data Distiller: Exploring Similarity Joins
    • FUNC 600: Advanced Statistics & Machine Learning Functions
  • About the Authors
Powered by GitBook
On this page
  • Overview
  • How IP Whitelisting Secures Analytical Tools
  • The Role of IP Whitelisting in Enhancing Security of Data Distiller
  • Computer Monitoring: A Complementary Measure
  • Understanding the Scope of Data Distiller IP Whitelisting
  • Scope of IP Whitelisting
  • Implications of API Access
  • Recommended Security Measures for APIs
  • Prerequisites
  • Set the IP Allow List Permissions
  • Generate the Access Token
  • Retrieve Sandbox Name
  • Enable Debugging in Python
  • Fetch All IP Ranges
  • Set New IP Ranges
  • Update IP Ranges
  • Validate a IP Address
  • Delete All IP Ranges
  1. Unit 8: DATA DISTILLER STATISTICS & MACHINE LEARNING

STATSML 201: Securing Data Distiller Access with Robust IP Whitelisting

Secure Access, Simplified: Protect Data Distiller with IP Whitelisting

Overview

This tutorial provides a comprehensive guide to implementing IP whitelisting to enhance the security of Data Distiller. IP whitelisting is a crucial security feature that allows you to define and manage specific IP ranges that are permitted to interact with your data, ensuring that only authorized networks and devices have access. This is particularly important when using tools like BI dashboards, Apache Superset, and JupyterLab, which are often installed on local or organizational machines.

How IP Whitelisting Secures Analytical Tools

IP whitelisting creates a secure perimeter around Data Distiller, ensuring that only devices within approved networks or with specific IP addresses can connect. Here’s how this benefits tools like Apache Superset, JupyterLab, and other BI platforms:

  1. Controlled Access for BI Tools: Business intelligence (BI) tools like Apache Superset, Tableau, and Power BI frequently query data from platforms like Data Distiller for visualization and reporting. With IP whitelisting, you can restrict access to these tools based on their hosting environment's IP addresses, ensuring that only known installations can connect.

  2. Enhanced Security for JupyterLab: Data scientists using JupyterLab often interact with Data Distiller for advanced analytics and model training. By enforcing IP whitelisting, you ensure that only authorized JupyterLab instances—installed on approved devices or in secure environments—can access the data lake, reducing the risk of data leakage.

  3. Restricting Access to Organizational Machines: IP whitelisting ensures that access is limited to machines within the organization's network or specific cloud environments. This is especially useful for:

    • On-premises setups: Ensuring that only devices within your local office network can access Data Distiller.

    • Cloud deployments: Restricting access to specific VMs or instances running on platforms like AWS or Azure.

  4. Minimizing Unauthorized Access: By defining allowed IP ranges, unauthorized users or devices attempting to connect from outside the whitelisted range are automatically denied access. This provides an additional layer of security beyond user credentials and API keys.

  5. Simplifying Monitoring and Auditing: With IP whitelisting in place, all access attempts are limited to approved sources. This makes it easier to monitor and audit access logs for suspicious activities or anomalies, ensuring compliance with data governance policies.

The Role of IP Whitelisting in Enhancing Security of Data Distiller

Here’s why IP whitelisting is a cornerstone of robust data security:

Enhanced Access Control: IP whitelisting ensures that only devices or networks operating within an approved IP range can access Data Distiller. This prevents external, unapproved networks, or malicious actors from attempting to infiltrate sensitive data resources. Tools like Apache Superset and JupyterLab installed on approved machines are securely confined within the authorized IP range, blocking access from any other source.

Alignment with Company Policies: Organizations can align their IP whitelisting configurations with corporate security policies by defining ranges for office locations, corporate VPNs, or other controlled environments. This ensures that only personnel using organization-sanctioned networks can query and process data, providing consistency and control over access permissions.

Minimized Risk of Credential Misuse: Even if user credentials are compromised, IP whitelisting adds a second layer of defense by denying login attempts from non-whitelisted IP addresses. This prevents attackers from exploiting stolen credentials if their devices are outside the approved network.

Integration with Monitoring and Auditing: Coupling IP whitelisting with Adobe Experience Platform’s Audit Service, amplifies security capabilities by providing detailed visibility into data access:

  • Detecting Anomalous Behavior: Unauthorized attempts from IPs outside the whitelist can be flagged for investigation, helping identify potential security breaches or misconfigurations.

  • Auditing Access Patterns: Logs of access requests from whitelisted IPs enable organizations to monitor usage patterns, detect anomalies, and address insider threats or credential misuse.

Customizable and Scalable: IP whitelisting is highly adaptable, allowing organizations to dynamically update approved IP ranges as infrastructure changes occur. Whether adding new office locations, onboarding cloud resources, or updating partner access, companies can ensure that their security measures evolve alongside their operational needs without compromising control.

Strengthened Regulatory Compliance: Many industries demand stringent controls over access to sensitive data to meet regulatory standards like GDPR, HIPAA, and financial security mandates. IP whitelisting supports these requirements by ensuring that data access is limited to explicitly authorized environments, reducing the risk of compliance violations.

Computer Monitoring: A Complementary Measure

While IP whitelisting provides perimeter-level security, monitoring access behavior within the whitelist is crucial. By implementing comprehensive monitoring systems:

  • Administrators can track user activities, query logs, and data access patterns.

  • Security teams can identify potential misuse or unauthorized queries within approved IP ranges.

  • Real-time alerts can help mitigate threats faster than periodic audits.

Together, IP whitelisting and computer monitoring form a multi-layered defense strategy for securing access to Data Distiller. This approach ensures that access is limited to company-approved networks while maintaining visibility into how services are being utilized, empowering organizations to proactively protect their data assets.

Understanding the Scope of Data Distiller IP Whitelisting

Data Distiller’s IP whitelisting is a robust mechanism designed specifically to secure access to its Postgres connectors, which are the backbone of integrations with BI tools such as Tableau, Power BI, Apache Superset, and analytical environments like JupyterLab. By allowing only predefined IP ranges to connect via these tools, organizations can ensure secure and controlled access to Query Service. However, it’s important to note the limitations of this feature, particularly when it comes to API access.

Scope of IP Whitelisting

  • Postgres Connectors: IP whitelisting is implemented exclusively for Postgres connectors. These connectors are commonly used by tools like JupyterLab and Apache Superset to interact with the Data Distiller environment. This ensures that only trusted networks or devices within approved IP ranges can execute queries or retrieve data through these tools.

  • APIs Not Covered: The IP whitelisting feature does not extend to APIs. APIs in Data Distiller can be used to:

    • Create and access Data Distiller jobs/schedules on the Data Lake.

    • Access data stored in the Data Distiller Warehouse, also known as Accel Store.

Implications of API Access

Since APIs bypass the IP whitelisting rules, their misuse could expose the system to potential security risks. If access tokens are compromised or misused, unauthorized entities could exploit API endpoints to manipulate jobs, access data, or extract information from the warehouse.

Recommended Security Measures for APIs

To mitigate risks and secure API access:

  1. Disable Token Access:

    • Restrict access to the APIs by ensuring that no access tokens are issued to users or applications that do not require API functionality.

    • This effectively blocks unauthorized API requests, as tokens are a prerequisite for authenticating API calls.

  2. Network-Level Restrictions: Implement additional network-level security measures such as firewalls or VPNs to limit access to the API endpoints from unauthorized environments.

  3. API Gateway and Monitoring:

    • Deploy an API gateway to enforce stricter access controls and logging.

    • Monitor API usage patterns for anomalies, such as unexpected data extraction or job creation requests, to proactively identify potential misuse.

Prerequisites

Obtain the necessary authentication credentials, including ACCESS_TOKEN, API_KEY, ORG_ID, and SANDBOX_NAMEby following instructions:

Set the IP Allow List Permissions

Ensure you have the necessary permissions to manage allowed IP ranges. The Manage Allowed List permission is required.

To enable permissions, you need to do the following:

  1. Navigate to AEP UI by going to Permissions->Roles->[Your_Role]->Edit

  1. Add the Manage Allow List and then Save the permission

Generate the Access Token

You should be executing the following piece of code:

!pip install requests

import requests

# Replace with your Adobe credentials
client_id = 'your_client_id'
client_secret = 'your_client_secret'
org_id = 'your_org_id'
tech_account_id = 'your_tech_account_id'
scope = 'scope'  
auth_endpoint = 'https://ims-na1.adobelogin.com/ims/token/v2'

# Prepare the data for the access token request
data = {
    'grant_type': 'client_credentials',
    'client_id': client_id,
    'client_secret': client_secret,
    'scope': scope  # Specify the scope relevant to your API usage
}

# Request the access token from Adobe IMS
response = requests.post(auth_endpoint, data=data)

if response.status_code == 200:
    access_token = response.json()['access_token']
    print(f"Access Token: {access_token}")
else:
    print(f"Failed to obtain access token. Status Code: {response.status_code}, Error: {response.text}")

Retrieve Sandbox Name

To access the sandbox name, you need to navigate to the AEP UI for Sandboxes->Browse. If you click on the sandbox name, you will get the name of the sandbox in the right panel.

Enable Debugging in Python

This code below helps you see everything happening behind the scenes when your program talks to a server over the internet. It’s like turning on a flashlight to see exactly what your program sends and receives, such as the details of the messages it sends to the server (requests) and what the server sends back (responses). It also shows any errors or unexpected behavior in this process. This is especially helpful when you're trying to figure out why something isn’t working, like if your program isn't sending the right information or the server isn’t responding as expected. It’s a tool for catching mistakes and making sure everything is working as planned.

We set up HTTP debugging in Python to monitor and log detailed information about HTTP requests and responses. By enabling the debuglevel attribute of HTTPConnection from the http.client module, the code allows low-level HTTP communications, such as request headers, response headers, and the raw data being transmitted, to be outputted to the console. The logging.basicConfig(level=logging.DEBUG) command configures the Python logging module to capture and display debug-level logs. This setup is particularly useful for troubleshooting API integrations, as it provides visibility into the request and response lifecycle, helping developers identify and resolve issues like incorrect headers, payload formatting, or unexpected server responses.

import logging
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig(level=logging.DEBUG)

Fetch All IP Ranges

IP Range Formats: The allowedIpRanges field can include two types of IP specifications:

  • CIDR: Standard CIDR notation (e.g., "136.23.110.0/23") to define IP ranges.

  • Fixed IP: Single IPs for individual access permissions (e.g., "101.10.1.1").

To manage allowed IP ranges for Data Distiller using Python, you can utilize the requests library to interact with the IP Access API. This API enables you to fetch, set, and delete IP ranges associated with your organization's ID.

You can retrieve the list of all IP ranges configured for your sandbox. If no IP ranges are set, all IPs are allowed by default.

import requests

# Define your credentials and headers
headers = {
    'Authorization': f'Bearer {access_token}',  # Replace with your actual access token
    'x-gw-ims-org-id': org_id,  # Replace with your actual Org ID
    'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS
    'x-sandbox-name': 'prod',  # Replace with your sandbox name
    'Content-Type': 'application/json',  # Indicates JSON content in the request
    'Accept': 'application/json'  # Indicates that JSON is expected in the response
}

# Define the API endpoint
url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the GET request
response = requests.get(url, headers=headers)

# Handle the response
if response.status_code == 200:
    print("Request was successful!")
    ip_ranges = response.json().get('allowedIpRanges', [])
    print("Allowed IP Ranges:", ip_ranges)
else:
    print(f"Request failed with status code {response.status_code}")
    print(f"Error message: {response.text}")

All requests are made to the /queryauth/security/ip-access endpoint of the Adobe Experience Platform.

Remember that you should only use the following hard coded values for the API key

'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS

The result will look like the following:

Set New IP Ranges

You can overwrite existing IP ranges by setting a new list for the sandbox. This operation requires a complete list of IP ranges, including any that remain unchanged.

import requests

# Define your credentials and headers
headers = {
    'Authorization': f'Bearer {access_token}',  # Replace with your actual access token
    'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS
    'x-gw-ims-org-id': org_id,  # Replace with your actual Org ID
    'x-sandbox-name': 'prod',  # Replace with your sandbox name
    'Content-Type': 'application/json',  # Indicates JSON content in the request
    'Accept': 'application/json'  # Indicates that JSON is expected in the response
}

# Define the new IP ranges to set with the correct key 'allowedIpRanges'
ip_ranges = {
    "allowedIpRanges": [ 
        {"ipRange": "136.23.110.0/23", "description": "VPN-1 gateway IPs"},
        {"ipRange": "17.102.17.0/23", "description": "VPN-2 gateway IPs"},
        {"ipRange": "101.10.1.1"},  # Single IP address
        {"ipRange": "163.77.30.9", "description": "Test server IP"}  # Single IP address with a description
    ]
}

# Define the API endpoint for IP range configuration
url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the PUT request
response = requests.put(url, headers=headers, json=ip_ranges)

# Handle the response
if response.status_code == 200:
    print("Successfully set new IP ranges.")
    print("Response:", response.json())
else:
    print(f"Failed to set IP ranges: {response.status_code}")
    print(f"Error message: {response.text}")

The result should be

Update IP Ranges

You can replace the old list of IP ranges with the new list provided in the updated_ip_ranges payload. The PUT request is designed to overwrite the current configuration of allowedIpRanges in Data Distiller with the payload specified in the request. Any existing IP ranges that are not included in the new payload will be removed from the configuration. The new list (updated_ip_ranges) will become the complete set of allowed IP ranges.

import requests

# Define your credentials and headers
headers = {
    'Authorization': f'Bearer {access_token}',  # Replace with your actual access token
    'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS
    'x-gw-ims-org-id': org_id,  # Replace with your actual Org ID
    'x-sandbox-name': 'prod',  # Replace with your sandbox name
    'Content-Type': 'application/json',  # Indicates JSON content in the request
    'Accept': 'application/json'  # Indicates that JSON is expected in the response
}

# Define the updated IP ranges to exclude the top two entries
updated_ip_ranges = {
    "allowedIpRanges": [
        {"ipRange": "101.10.1.1"},  # Single IP address
        {"ipRange": "163.77.30.9", "description": "Test server IP"}  # Single IP address with a description
    ]
}

# Define the API endpoint for IP range configuration
url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the PUT request to update the IP ranges
response_put = requests.put(url, headers=headers, json=updated_ip_ranges)

# Handle the response for the PUT request
if response_put.status_code == 200:
    print("Successfully updated IP ranges.")
    print("Response:", response_put.json())
else:
    print(f"Failed to update IP ranges: {response_put.status_code}")
    print(f"Error message: {response_put.text}")

The response will be:

Validate a IP Address

You can use the IP Validation API endpoint to determine whether a specific IP address is authorized to access a designated sandbox for Data Distiller usage. This ensures clarity on whether access restrictions are in place and if the IP address has the necessary permissions to interact with data within the sandbox.

import requests

# Define your credentials and headers
headers = {
    'Authorization': f'Bearer {access_token}',  # Replace with your actual access token
    'x-gw-ims-org-id': org_id,  # Replace with your actual Org ID
    'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS
    'x-sandbox-name': 'prod',  # Replace with your sandbox name
    'Content-Type': 'application/json',  # Indicates JSON content in the request
    'Accept': 'application/json'  # Indicates that JSON is expected in the response
}

# Define the Validate API endpoint
url = 'https://platform.adobe.io/data/foundation/queryauth/security/validate/ip-access'

# Define the request body
payload = {
    "ipAddress": "197.2.0.2"  # Replace with the IP address to validate
}

# Make the POST request
response = requests.post(url, headers=headers, json=payload)

# Handle the response
if response.status_code == 200:
    print("Validation was successful!")
    print("Response:", response.json())
elif response.status_code == 404:
    print("Endpoint not found. Verify the URL and ensure your organization has access to this API.")
else:
    print(f"Validation failed with status code {response.status_code}")
    print(f"Error message: {response.text}")

The result will look like the following:

Delete All IP Ranges

You can remove all configured IP ranges for the sandbox. This action deletes the IP ranges and returns the deleted IP list.

import requests

# Define your credentials and headers
headers = {
    'Authorization': f'Bearer {access_token}',  # Replace with your actual access token
    'x-api-key': 'acp_queryService_auth',  # DO NOT REPLACE THIS
    'x-gw-ims-org-id': org_id,  # Replace with your actual Org ID
    'x-sandbox-name': 'prod',  # Replace with your sandbox name
    'Content-Type': 'application/json'  # Indicates JSON content in the request
}

# Define the API endpoint for IP range deletion
url = 'https://platform.adobe.io/data/foundation/queryauth/security/ip-access'

# Make the DELETE request
response_delete = requests.delete(url, headers=headers)

# Handle the response
if response_delete.status_code == 200:
    print("Successfully deleted all IP ranges.")
    deleted_ip_ranges = response_delete.json().get('deletedIpRanges', [])
    print("Deleted IP Ranges:", deleted_ip_ranges)
else:
    print(f"Failed to delete IP ranges: {response_delete.status_code}")
    print(f"Error message: {response_delete.text}")

The result should look like the following:

Last updated 5 months ago

Make sure you have followed all the steps to get the access token :

STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
here
STATSML 200: Unlock Dataset Metadata Insights via Adobe Experience Platform APIs and Python
Access the role permissions.
Choose Manage Allow List
Sandbox name can be accessed in Adobe Experience Platform UI.
Return of the IP address ranges
IP ranges are now set for accepting queries into Data Distiller
Response of the updated IP range code
You can see that this IP address does not lie in the list allowed and hence the result is false.
All IP ranges are deleted
Page cover image