BI 200: Create Your First Data Model in the Data Distiller Warehouse for Dashboarding
Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to store and manage data.
Last updated
Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to store and manage data.
Last updated
You should read the Architecture Overview to understand the various query engines supported by Data Distiller:
In September 2023, Data Distiller added a new SQL compute engine (called Query Accelerated Engine or the Data Distiller Warehousing Engine) specifically geared toward business intelligence, dashboarding, and reporting use cases. The storage and compute layer sits parallel to the data lake and its associated SQL engines (Ad Hoc Query and Batch). As a Data Distiller user, you will see all of the tables i.e. AEP datasets in a single catalog view.
A portion of the Accelerated Store is used by AEP to power the out-of-the-box dashboards seen in Adobe Real-Time CDP (RTCDP) Overviews pages of Profile, Audiences, and Destinations. You will also see similar dashboards show up on Adobe Journey Optimizer (AJO) pages for reporting on journeys and campaigns. The tables used in creating the star schemas (or datagroups as Data Distiller calls them) are available for SQL querying with those apps. There is no customization available at the data layer for these models.
In this example, we will visit the use case of fully building our own star schema as we would have done in our enterprise warehouse except that we will be doing it within AEP.
There are a few things you need to be aware of so that you can implement a robust design for your end users:
Data Storage: The Query Accelerated layer has a limitation of 1 TB of data that you can keep in the star schemas. This excludes all of the star schemas that were created for out-of-the-box dashboards for RTCDP or AJO. If you copy raw data into this layer, be aware of this size limitation.
Query Concurrency: There can be a maximum of 4 concurrent queries in the system. These queries include read, write, and update operations. If you do long-running ETL jobs that are computing and writing data into the accelerated layer, it can impact the BI dashboard performance. This is not uncommon in such systems and you should make sure you have designed this well.
User Concurrency: User concurrency is unlimited, unlike the Ad Hoc Query engine. Make sure you create a Technical User Account for each of the dashboards you publish.
Caching: The Query Accelerated layer has a caching layer that caches the queries across all users so you should see a performance boost for the same dashboard used across multiple users. The cold-start problem where the first user in the system may have the longest loading time is something to be aware of.
Flat Tables: You cannot create nested data structures within the Query Accelerated Layer. The whole idea is to keep the data in fact and dimension tables much like what is seen in data warehousing practice.
Built-in BI Dashboards: Data Distiller also integrates with (User-Defined) Dashboards which you can see in the AEP UI. All the star schemas you build along with the ones that are shipped by AEP are accessible here. If your use case demands that access to the insights be accessible within AEP or you have resource constraints to get a BI license, then this is the tool for you. It is not a fully-fledged BI dashboarding capability but has enough to get you the visualization you are looking for.
More Capacity: If you are looking for more capacity (storage & query concurrency) than what ships today, then you need to contact an Adobe rep.
Our scenario is pretty straightforward as we will be showing the mechanics of creating a fact table with aggregated snapshot data of purchases by country code. There is a lookup table that has the country ID mapped to the country code. This lookup table will function as a dimension table. The goal is to create a star schema in the Accelerated Store.
The files that you will need are the following
Follow the steps for importing smaller-sized CSV files (1GB or less) as datasets outlined here:
Tables contain the data. Think of a database as a logical grouping of tables and schemas as the next-level logical grouping underneath it. The database grouping is useful when you want to keep the data separated for various reporting use cases. From a BI tool perspective, this means that you should not be able to answer questions only specific to this database grouping.
Note: Irrespective of the database and schema groupings that you create, all of these tables are accessible just like the datasets in the data lake for querying and joining. The logical grouping makes the most sense when you access these databases within a BI tool like Tableau or PowerBI.
Navigate to the AEP navigation bar. Click on Queries->Create Query
Make sure you choose the prod: all database in the database dropdown
Let us first create a database called testexample
Let us take a look at the definition in the SQL metadata commands:
TYPE=QSACCEL:
This specifies that the database and the schemas are custom-built and reside in the customizable layer of the Accelerated Store
ACCOUNT=acp_query_batch:
This specifies that you have the Data Distiller license in order to create the tables.
Note: The customization feature works only if you have the Data Distiller license.
Best Practice: As you start thinking of putting the data model in production, consider prototyping on the AEP Data Lake and use SQL as a means to make sure that all the queries work including the ones in the charts. All of the steps in this tutorial apply. The only difference is that instead of using CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch)
you will use CREATE DATABASE testexample
This will create the database, the schema and the data model on the data lake. When going to production, drop all the databases, schemas and tables
We will create two logical partitions (just because we want to) to separate this database into fact and lookup schemas
Each combination of database and schema i.e.
testexample.lookups
and testexample.facttables
are separate data models.
Let us now rename this data models into a friendly name for access within dashboards:
Also, note that if you made a mistake in creating a database or schemas, you can do the following:
First, you need to access the schemas in a database testexample
:
For each schema, you should delete them:
After all the schemas are dropped, you can drop the database:
The keyword CASCADE is a helpful keyword because it not just deletes the schemas but also deletes the underlying tables.
Tables actually carry the data we will need for our dashboarding/reporting use cases. Unlike the data lake tables, you will need to predefine the data types for these tables.
Let us define the lookups table first:
Let us define the fact tables first:
If you ever need to surgically drop a table, you can use the following:
Tip: Postgres SQL treats names of objects (database, schemas, and tables) in a case-insensitive manner. If you use uppercase letters in naming, it will treat it as if its in lowercase.
This relationship building is required for the BI tool to make sense of this star schema. The User-Defined Dashboards also use these associations in "joining" these lookups on the fly which is helpful when building visualizations.
The code is specified in the following way:
Warning: If you use the full path specification such as testexample.facttables.crm_table and testexample.lookups.country_lookup in the ALTER TABLE command, you will get errors.
The key points to note are the following:
crm_table is the core fact table that has a key relationship on country_code
REFERENCES means that crm_table is joining in country_lookup on the country_code key.
NOT enforced means that the JOIN can be a 1-to-many on the country_lookup table. If there are two values for a country_code to country_name, we will do the JOIN on both. It is your responsibility to make sure that the country_lookup does not have duplicates for a single country_code value.
Tip: In DBVisualizer, if you use metadata commands like ALTER TABLE, executing a subsequent SELECT command will not return any results. You will need to disconnect and reconnect to the Data Distiller database. Metadata commands make changes to the Data Distiller database and these changes are not pushed out into the client i.e. DBVisualizer. The client has to make the calls to get thesee updates.
You could execute all of this code; query by query bu highlighting the query and then using the Run Selected Query feature in the Data Distiller Query Pro Mode Editor:
Cheat code:
Always ensure you are in the correct database when using the Data Distiller Query Pro Mode Editor. If you are querying tables from the AEP Data Lake or creating tables in the Accelerated Store, make sure you are in the prod;all
database. When querying tables in the Accelerated Store, you should be in the specific database that corresponds to the data model containing the tables you're querying.
Data Distiller enables you to read and write datasets across both the AEP Data Lake and the Accelerated Store. From the query engine's perspective, when using the Data Distiller Query Pro Mode Editor (or a third party client), you can seamlessly join tables from both stores and persist the results across them. To ensure the query engine writes to the Accelerated Store rather than the AEP Data Lake, you need to qualify the data model using specific parameters, such as WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch)
. This syntax signals to the system that the target is the Accelerated Store.
The Data Distiller Query Pro Mode Editor used for chart creation cannot access tables in the AEP Data Lake due to a 60-second timeout limit for chart queries. If a query takes longer to run because of the latency in reading from the data lake, it will time out. This is one of the reasons why the Accelerated Store was built—to mitigate these latencies caused by the data infrastructure layer. While data lakes are excellent for storing large volumes of data, running exploration queries that can take minutes, and executing batch jobs to create new datasets, they are not optimized for dashboarding. In contrast, the Accelerated Store's data layer is designed to efficiently serve queries for dashboards.
However, in the main Data Distiller Query Pro Mode Editor within the AEP UI (i.e., Queries → Create Query), queries can be executed across both the AEP Data Lake and the Accelerated Store in a federated manner. These queries can read from and write to both stores without the 60-second timeout restriction, although they may time out after 10 minutes. This flexibility allows for longer-running queries that can return results in minutes, avoiding the limitations present in chart creation.
Let us hydrate the tables
Observe that the data is being read from the AEP Data Lake but it is being inserted into the Accelerated Store. The Data Distiller Batch Query Engine is read from one store (AEP Data Lake) and writing in a federated fashion into the
Note the following:
The order of the specification of the columns (country_code, purchase) in the crm_table destination table needs to match the same order of the columns (country, purchase_amount) in the source table purchases_dataset_crm_data.
The order of the specification of the columns (country_code, country_name) in the country_lookup destination table needs to match the same order of the columns (country_code, country) in the source table country_codes.
Let us test some queries. Since these tables are in the Accelerated Store, we should now choose the database as testexample
The results are the following:
Also execute the following query as well:
The results are the following:
Note that all the tables that get created here need not have a unique name spanning both the Data Lake and the Accelerated Store. Hence, you need to specify the full path in the CREATE TABLE
metadata command.
Since data models are logical partitions, all tables that are part of data models within the Accelerated Store are accessible in Data Distiller Query Pro Mode for mixing and matching. These tables can also be used for dashboarding purposes.
The full path of a table
i.e.database_name. schema_name.table_name
has to be unique across the AEP Data Lake and Accelerated Store. The system will not let you create duplicate tables underneath the same data model (database_name. schema_name)
Let us do a join query between the fact and the dimension tables. Our goal is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.
The results will be:
Let us run the aggregation query. Our goal again is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.
Tip: In DBVisualizer, note that you need to ALIAS or AS construct to name the column. For example, sum(purchase) should be aliased as purchase. The limitation does not exist in the Data Distiller SQL Query Editor.
The results will be: