BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding

Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to store and manage data.

Prerequisites

You should read the Architecture Overview to understand the various query engines supported by Data Distiller:

PREP 302: Key Topics Overview: Architecture, MDM, Personas

What is the Accelerated Store?

In September 2023, Data Distiller added a new SQL compute engine (called Query Accelerated Engine or the Data Distiller Warehousing Engine) specifically geared toward business intelligence, dashboarding, and reporting use cases. The storage and compute layer sits parallel to the data lake and its associated SQL engines (Ad Hoc Query and Batch). As a Data Distiller user, you will see all of the tables i.e. AEP datasets in a single catalog view.

A portion of the Accelerated Store is used by AEP to power the out-of-the-box dashboards seen in Adobe Real-Time CDP (RTCDP) Overviews pages of Profile, Audiences, and Destinations. You will also see similar dashboards show up on Adobe Journey Optimizer (AJO) pages for reporting on journeys and campaigns. The tables used in creating the star schemas (or datagroups as Data Distiller calls them) are available for SQL querying with those apps. There is no customization available at the data layer for these models.

In this example, we will visit the use case of fully building our own star schema as we would have done in our enterprise warehouse except that we will be doing it within AEP.

There are a few things you need to be aware of so that you can implement a robust design for your end users:

Data Storage: The Query Accelerated layer has a limitation of 1 TB of data that you can keep in the star schemas. This excludes all of the star schemas that were created for out-of-the-box dashboards for RTCDP or AJO. If you copy raw data into this layer, be aware of this size limitation.
Query Concurrency: There can be a maximum of 4 concurrent queries in the system. These queries include read, write, and update operations. If you do long-running ETL jobs that are computing and writing data into the accelerated layer, it can impact the BI dashboard performance. This is not uncommon in such systems and you should make sure you have designed this well.
User Concurrency: User concurrency is unlimited, unlike the Ad Hoc Query engine. Make sure you create a Technical User Account for each of the dashboards you publish.
Caching: The Query Accelerated layer has a caching layer that caches the queries across all users so you should see a performance boost for the same dashboard used across multiple users. The cold-start problem where the first user in the system may have the longest loading time is something to be aware of.
Flat Tables: You cannot create nested data structures within the Query Accelerated Layer. The whole idea is to keep the data in fact and dimension tables much like what is seen in data warehousing practice.
Built-in BI Dashboards: Data Distiller also integrates with (User-Defined) Dashboards which you can see in the AEP UI. All the star schemas you build along with the ones that are shipped by AEP are accessible here. If your use case demands that access to the insights be accessible within AEP or you have resource constraints to get a BI license, then this is the tool for you. It is not a fully-fledged BI dashboarding capability but has enough to get you the visualization you are looking for.
More Capacity: If you are looking for more capacity (storage & query concurrency) than what ships today, then you need to contact an Adobe rep.

Scenario

Our scenario is pretty straightforward as we will be showing the mechanics of creating a fact table with aggregated snapshot data of purchases by country code. There is a lookup table that has the country ID mapped to the country code. This lookup table will function as a dimension table. The goal is to create a star schema in the Accelerated Store.

The files that you will need are the following

11KB

73KB

Follow the steps for importing smaller-sized CSV files (1GB or less) as datasets outlined here:

Create a Database and a Schema

Tables contain the data. Think of a database as a logical grouping of tables and schemas as the next-level logical grouping underneath it. The database grouping is useful when you want to keep the data separated for various reporting use cases. From a BI tool perspective, this means that you should not be able to answer questions only specific to this database grouping.

Note: Irrespective of the database and schema groupings that you create, all of these tables are accessible just like the datasets in the data lake for querying and joining. The logical grouping makes the most sense when you access these databases within a BI tool like Tableau or PowerBI.

Navigate to the AEP navigation bar. Click on Queries->Create Query
Make sure you choose the prod: all database in the database dropdown

Let us first create a database called testexample

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

Let us take a look at the definition in the SQL metadata commands:

TYPE=QSACCEL: This specifies that the database and the schemas are custom-built and reside in the customizable layer of the Accelerated Store
ACCOUNT=acp_query_batch: This specifies that you have the Data Distiller license in order to create the tables.

Note: The customization feature works only if you have the Data Distiller license.

Best Practice: As you start thinking of putting the data model in production, consider prototyping on the AEP Data Lake and use SQL as a means to make sure that all the queries work including the ones in the charts. All of the steps in this tutorial apply. The only difference is that instead of using CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch)

you will use CREATE DATABASE testexample

This will create the database, the schema and the data model on the data lake. When going to production, drop all the databases, schemas and tables

We will create two logical partitions (just because we want to) to separate this database into fact and lookup schemas

CREATE SCHEMA testexample.lookups;
CREATE SCHEMA testexample.facttables;

Each combination of database and schema i.e. testexample.lookups and testexample.facttables are separate data models.

Let us now rename this data models into a friendly name for access within dashboards:

ALTER MODEL testexample.facttables RENAME TO test_purchase;

ALTER MODEL testexample.facttables RENAME TO test_country;

Also, note that if you made a mistake in creating a database or schemas, you can do the following:

First, you need to access the schemas in a database testexample:

SHOW SCHEMAS;

For each schema, you should delete them:

DROP SCHEMA IF EXISTS testexample.lookups CASCADE;
DROP SCHEMA IF EXISTS testexample.facttables CASCADE;

After all the schemas are dropped, you can drop the database:

DROP DATABASE IF EXISTS testexample;

The keyword CASCADE is a helpful keyword because it not just deletes the schemas but also deletes the underlying tables.

Create Tables

Tables actually carry the data we will need for our dashboarding/reporting use cases. Unlike the data lake tables, you will need to predefine the data types for these tables.

Let us define the lookups table first:

CREATE TABLE testexample.lookups.country_lookup AS
       SELECT
          cast(null as string) country_code,
          cast(null as string) country_name
       WHERE false;

Let us define the fact tables first:

CREATE TABLE testexample.facttables.crm_table AS
       SELECT
          cast(null as int)    purchase,
          cast(null as string)   country_code
WHERE false;

If you ever need to surgically drop a table, you can use the following:

DROP TABLE IF EXISTS testexample.lookups.country_lookup;

DROP TABLE IF EXISTS testexample.facttables.crm_table;

Tip: Postgres SQL treats names of objects (database, schemas, and tables) in a case-insensitive manner. If you use uppercase letters in naming, it will treat it as if its in lowercase.

Build Relationships Between Fact and Dimension Tables

This relationship building is required for the BI tool to make sense of this star schema. The User-Defined Dashboards also use these associations in "joining" these lookups on the fly which is helpful when building visualizations.

The code is specified in the following way:

ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES country_lookup(country_code) NOT enforced;

Warning: If you use the full path specification such as testexample.facttables.crm_table and testexample.lookups.country_lookup in the ALTER TABLE command, you will get errors.

The key points to note are the following:

crm_table is the core fact table that has a key relationship on country_code
REFERENCES means that crm_table is joining in country_lookup on the country_code key.
NOT enforced means that the JOIN can be a 1-to-many on the country_lookup table. If there are two values for a country_code to country_name, we will do the JOIN on both. It is your responsibility to make sure that the country_lookup does not have duplicates for a single country_code value.

Tip: In DBVisualizer, if you use metadata commands like ALTER TABLE, executing a subsequent SELECT command will not return any results. You will need to disconnect and reconnect to the Data Distiller database. Metadata commands make changes to the Data Distiller database and these changes are not pushed out into the client i.e. DBVisualizer. The client has to make the calls to get thesee updates.

Cheat Sheet

You could execute all of this code; query by query bu highlighting the query and then using the Run Selected Query feature in the Data Distiller Query Pro Mode Editor:

Cheat code:

-- Create the Database in Accelerated Store
CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

-- Create the Schemas on the Database
CREATE schema testexample.lookups;
CREATE schema testexample.facttables;

-- Create the Tabbles
CREATE TABLE testexample.facttables.crm_table AS
       SELECT
          cast(null as int)    purchase,
          cast(null as string)   country_code
WHERE false;

CREATE TABLE testexample.lookups.country_lookup AS
       SELECT
          cast(null as string) country_code,
          cast(null as string) country_name
       WHERE false;

-- Define a key relationship between the country code and the lookup
ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES country_lookup(country_code) NOT enforced;

-- Rename the data models to make them user friendly names in Query Pro Mode
ALTER MODEL testexample.facttables RENAME TO test_purchase;
ALTER MODEL testexample.lookups RENAME TO test_country;

-- Hydrate the Tables
INSERT INTO testexample.facttables.crm_table (country_code, purchase)
(SELECT country, purchase_amount FROM purchases_dataset_crm_data);

INSERT INTO testexample.lookups.country_lookup(country_code, country_name)
(select country_code, country from(
select count(*), country_code, country from country_codes
group by country_code, country));

Always ensure you are in the correct database when using the Data Distiller Query Pro Mode Editor. If you are querying tables from the AEP Data Lake or creating tables in the Accelerated Store, make sure you are in the prod;all database. When querying tables in the Accelerated Store, you should be in the specific database that corresponds to the data model containing the tables you're querying.

Federation Access, Reads and Writes Across AEP Data Lake and Accelerated Store

A Note on Federation

Data Distiller enables you to read and write datasets across both the AEP Data Lake and the Accelerated Store. From the query engine's perspective, when using the Data Distiller Query Pro Mode Editor (or a third party client), you can seamlessly join tables from both stores and persist the results across them. To ensure the query engine writes to the Accelerated Store rather than the AEP Data Lake, you need to qualify the data model using specific parameters, such as WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch). This syntax signals to the system that the target is the Accelerated Store.

A Note on Dashboards

The Data Distiller Query Pro Mode Editor used for chart creation cannot access tables in the AEP Data Lake due to a 60-second timeout limit for chart queries. If a query takes longer to run because of the latency in reading from the data lake, it will time out. This is one of the reasons why the Accelerated Store was built—to mitigate these latencies caused by the data infrastructure layer. While data lakes are excellent for storing large volumes of data, running exploration queries that can take minutes, and executing batch jobs to create new datasets, they are not optimized for dashboarding. In contrast, the Accelerated Store's data layer is designed to efficiently serve queries for dashboards.

However, in the main Data Distiller Query Pro Mode Editor within the AEP UI (i.e., Queries → Create Query), queries can be executed across both the AEP Data Lake and the Accelerated Store in a federated manner. These queries can read from and write to both stores without the 60-second timeout restriction, although they may time out after 10 minutes. This flexibility allows for longer-running queries that can return results in minutes, avoiding the limitations present in chart creation.

Hydrate the Tables

Let us hydrate the tables

INSERT INTO testexample.facttables.crm_table (country_code, purchase)
(SELECT country, purchase_amount FROM purchases_dataset_crm_data)

INSERT INTO testexample.lookups.country_lookup(country_code, country_name)
(SELECT country_code, country FROM(
SELECT count(*), country_code, country FROM country_codes
GROUP BY country_code, country))

Observe that the data is being read from the AEP Data Lake but it is being inserted into the Accelerated Store. The Data Distiller Batch Query Engine is read from one store (AEP Data Lake) and writing in a federated fashion into the

Note the following:

The order of the specification of the columns (country_code, purchase) in the crm_table destination table needs to match the same order of the columns (country, purchase_amount) in the source table purchases_dataset_crm_data.
The order of the specification of the columns (country_code, country_name) in the country_lookup destination table needs to match the same order of the columns (country_code, country) in the source table country_codes.

Let us test some queries. Since these tables are in the Accelerated Store, we should now choose the database as testexample

SELECT * FROM testexample.facttables.crm_table;

The results are the following:

Also execute the following query as well:

SELECT * FROM testexample.lookups.country_lookup;

The results are the following:

Note that all the tables that get created here need not have a unique name spanning both the Data Lake and the Accelerated Store. Hence, you need to specify the full path in the CREATE TABLE metadata command.

Federation Across Data Models in Accelerated Store

A Note on Federation Across Data Models

Since data models are logical partitions, all tables that are part of data models within the Accelerated Store are accessible in Data Distiller Query Pro Mode for mixing and matching. These tables can also be used for dashboarding purposes.

The full path of a table i.e.database_name. schema_name.table_name has to be unique across the AEP Data Lake and Accelerated Store. The system will not let you create duplicate tables underneath the same data model (database_name. schema_name)

Resolve Country ID

Let us do a join query between the fact and the dimension tables. Our goal is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT a.purchase,a.country_code,b.country_name FROM crm_table a
INNER JOIN country_lookup b ON a.country_code = b.country_code

The results will be:

Aggregation Queries

Let us run the aggregation query. Our goal again is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT sum(purchase) AS purchase, country_name FROM testexample.facttables.crm_table 
INNER JOIN testexample.lookups.country_lookup ON testexample.facttables.crm_table.country_code = testexample.lookups.country_lookup.country_code
GROUP BY country_name;

Tip: In DBVisualizer, note that you need to ALIAS or AS construct to name the column. For example, sum(purchase) should be aliased as purchase. The limitation does not exist in the Data Distiller SQL Query Editor.

The results will be:

Last updated 1 year ago