Insights 102: Creating Your First Table in the Accelerated Store

Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to store and manage data.

Prerequisites

You should read the Architecture Overview to understand the various query engines supported by Data Distiller:

pagePrereq 102: Key Topics Overview: Architecture, MDM, Personas

What is the Accelerated Store?

In September 2023, Data Distiller added a new SQL compute engine (called Query Accelerated Engine) specifically geared toward business intelligence, dashboarding, and reporting use cases. The storage and compute layer sits parallel to the data lake and its associated SQL engines (Ad Hoc Query and Batch). As a Data Distiller user, you will see all of the tables i.e. AEP datasets in a single catalog view.

A portion of the Accelerated Store is used by AEP to power the out-of-the-box dashboards seen in Adobe Real-Time CDP (RTCDP) Overviews pages of Profile, Audiences, and Destinations. You will also see similar dashboards show up on Adobe Journey Optimizer (AJO) pages for reporting on journeys and campaigns. The tables used in creating the star schemas (or datagroups as Data Distiller calls them) are available for SQL querying with those apps. There is no customization available at the data layer for these models.

In this example, we will visit the use case of fully building our own star schema as we would have done in our enterprise warehouse except that we will be doing it within AEP.

There are a few things you need to be aware of so that you can implement a robust design for your end users:

Data Storage: The Query Accelerated layer has a limitation of 1 TB of data that you can keep in the star schemas. This excludes all of the star schemas that were created for out-of-the-box dashboards for RTCDP or AJO. If you copy raw data into this layer, be aware of this size limitation.
Query Concurrency: There can be a maximum of 4 concurrent queries in the system. These queries include read, write, and update operations. If you do long-running ETL jobs that are computing and writing data into the accelerated layer, it can impact the BI dashboard performance. This is not uncommon in such systems and you should make sure you have designed this well.
User Concurrency: User concurrency is unlimited, unlike the Ad Hoc Query engine. Make sure you create a Technical User Account for each of the dashboards you publish.
Caching: The Query Accelerated layer has a caching layer that caches the queries across all users so you should see a performance boost for the same dashboard used across multiple users. The cold-start problem where the first user in the system may have the longest loading time is something to be aware of.
Flat Tables: You cannot create nested data structures within the Query Accelerated Layer. The whole idea is to keep the data in fact and dimension tables much like what is seen in data warehousing practice.
Built-in BI Dashboards: Data Distiller also integrates with (User-Defined) Dashboards which you can see in the AEP UI. All the star schemas you build along with the ones that are shipped by AEP are accessible here. If your use case demands that access to the insights be accessible within AEP or you have resource constraints to get a BI license, then this is the tool for you. It is not a fully-fledged BI dashboarding capability but has enough to get you the visualization you are looking for.
More Capacity: If you are looking for more capacity (storage & query concurrency) than what ships today, then you need to contact an Adobe rep.

Scenario

Our scenario is pretty straightforward as we will be showing the mechanics of creating a fact table with aggregated snapshot data of purchases by country code. There is a lookup table that has the country ID mapped to the country code. This lookup table will function as a dimension table. The goal is to create a star schema in the Accelerated Store.

The files that you will need are the following

Follow the steps for importing smaller-sized CSV files (1GB or less) as datasets outlined here:

pageProfile 101: Data Distillation for Movie Genre Targeting with Email Lists

Create a Database and a Schema

Tables contain the data. Think of a database as a logical grouping of tables and schemas as the next-level logical grouping underneath it. The database grouping is useful when you want to keep the data separated for various reporting use cases. From a BI tool perspective, this means that you should not be able to answer questions only specific to this database grouping.

Note: Irrespective of the database and schema groupings that you create, all of these tables are accessible just like the datasets in the data lake for querying and joining. The logical grouping makes the most sense when you access these databases within a BI tool like Tableau or PowerBI.

Let us first create a database called testexample

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

Let us take a look at the definition in the SQL metadata commands:

TYPE=QSACCEL: This specifies that the database and the schemas are custom-built and reside in the customizable layer of the Accelerated Store
ACCOUNT=acp_query_batch: This specifies that you have the Data Distiller license in order to create the tables.

Note: The customization feature works only if you have the Data Distiller license.

We will create two logical partitions (just because we want to) to separate this database into fact and lookup schemas

CREATE schema testexample.lookups;
CREATE schema testexample.facttables;

Note that if you made a mistake in creating a database or schemas, you can do the following:

DROP DATABASE IF EXISTS testexample; 
DROP SCHEMA IF EXISTS testexample.lookups CASCADE;
DROP SCHEMA IF EXISTS testexample.facttables CASCADE;

The keyword CASCADE is a helpful keyword because it not just deletes the schemas but also deletes the underlying tables.

Create Tables

Tables actually carry the data we will need for our dashboarding/reporting use cases. Unlike the data lake tables, you will need to predefine the data types for these tables.

Let us define the lookups table first:

CREATE TABLE testexample.lookups.country_lookup AS
       SELECT
          cast(null as string) country_code,
          cast(null as string) country_name
       WHERE false;

Let us define the fact tables first:

CREATE TABLE testexample.facttables.crm_table AS
       SELECT
          cast(null as int)    purchase,
          cast(null as string)   country_code
WHERE false;

If you ever need to surgically drop a table, you can use the following:

DROP TABLE IF EXISTS testexample.lookups.country_lookup;

DROP TABLE IF EXISTS testexample.facttables.crm_table;

Tip: Postgres SQL treats names of objects (database, schemas, and tables) in a case-insensitive manner. If you use uppercase letters in naming, it will treat it as if its in lowercase.

Build Relationships Between Fact and Dimension Tables

This relationship building is required for the BI tool to make sense of this star schema. The User-Defined Dashboards also use these associations in "joining" these lookups on the fly which is helpful when building visualizations.

The code is specified in the following way:

ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES country_lookup(country_code) NOT enforced;

Warning: If you use the full path specification such as testexample.facttables.crm_table and testexample.lookups.country_lookup in the ALTER TABLE command, you will get errors.

The key points to note are the following:

crm_table is the core fact table that has a key relationship on country_code
REFERENCES means that crm_table is joining in country_lookup on the country_code key.
NOT enforced means that the JOIN can be a 1-to-many on the country_lookup table. If there are two values for a country_code to country_name, we will do the JOIN on both. It is your responsibility to make sure that the country_lookup does not have duplicates for a single country_code value.

Tip: In DBVisualizer, if you use metadata commands like ALTER TABLE, executing a subsequent SELECT command will not return any results. You will need to disconnect and reconnect to the Data Distiller database. Metadata commands make changes to the Data Distiller database and these changes are not pushed out into the client i.e. DBVisualizer. The client has to make the calls to get thesee updates.

Hydrate the Tables

Let us hydrate the tables

INSERT INTO testexample.facttables.crm_table (country_code, purchase)
(SELECT country, purchase_amount FROM purchases_dataset_crm_data)

INSERT INTO testexample.lookups.country_lookup(country_code, country_name)
(select country_code, country from(
select count(*), country_code, country from country_codes
group by country_code, country))

Note the following:

The order of the specification of the columns (country_code, purchase) in the crm_table destination table needs to match the same order of the columns (country, purchase_amount) in the source table purchases_dataset_crm_data.
The order of the specification of the columns (country_code, country_name) in the country_lookup destination table needs to match the same order of the columns (country_code, country) in the source table country_codes.

Let us test some queries:

select * from testexample.facttables.crm_table 
select * from testexample.lookups.country_lookup

The results are the following:

Note that all the tables that get created here will have a unique name spanning both the Data Lake and the Accelerated Store and you need to specify this once in the CREATE TABLE metadata command. This means that we could have run all of the following queries by dropping the full dot notation:

Resolve Country ID

Let us do a join query between the fact and the dimension tables. Our goal is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT a.purchase,a.country_code,b.country_name FROM crm_table a
INNER JOIN country_lookup b ON a.country_code = b.country_code

The results will be:

Aggregation Queries

Let us run the aggregation query. Our goal again is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT sum(purchase) AS purchase, country_name FROM testexample.facttables.crm_table 
INNER JOIN testexample.lookups.country_lookup ON testexample.facttables.crm_table.country_code = testexample.lookups.country_lookup.country_code
GROUP BY country_name;

Tip: In DBVisualizer, note that you need to ALIAS or AS construct to name the column. For example, sum(purchase) should be aliased as purchase. The limitation does not exist in the Data Distiller SQL Query Editor.

The results will be:

Last updated 7 months ago

Insights 102: Creating Your First Table in the Accelerated Store

Creating your first table in the Accelerated Store involves defining and setting up a star schema containing tables to store and manage data.

Prerequisites

You should read the Architecture Overview to understand the various query engines supported by Data Distiller:

pagePrereq 102: Key Topics Overview: Architecture, MDM, Personas

What is the Accelerated Store?

In this example, we will visit the use case of fully building our own star schema as we would have done in our enterprise warehouse except that we will be doing it within AEP.

There are a few things you need to be aware of so that you can implement a robust design for your end users:

Data Storage: The Query Accelerated layer has a limitation of 1 TB of data that you can keep in the star schemas. This excludes all of the star schemas that were created for out-of-the-box dashboards for RTCDP or AJO. If you copy raw data into this layer, be aware of this size limitation.
Query Concurrency: There can be a maximum of 4 concurrent queries in the system. These queries include read, write, and update operations. If you do long-running ETL jobs that are computing and writing data into the accelerated layer, it can impact the BI dashboard performance. This is not uncommon in such systems and you should make sure you have designed this well.
User Concurrency: User concurrency is unlimited, unlike the Ad Hoc Query engine. Make sure you create a Technical User Account for each of the dashboards you publish.
Caching: The Query Accelerated layer has a caching layer that caches the queries across all users so you should see a performance boost for the same dashboard used across multiple users. The cold-start problem where the first user in the system may have the longest loading time is something to be aware of.
Flat Tables: You cannot create nested data structures within the Query Accelerated Layer. The whole idea is to keep the data in fact and dimension tables much like what is seen in data warehousing practice.
Built-in BI Dashboards: Data Distiller also integrates with (User-Defined) Dashboards which you can see in the AEP UI. All the star schemas you build along with the ones that are shipped by AEP are accessible here. If your use case demands that access to the insights be accessible within AEP or you have resource constraints to get a BI license, then this is the tool for you. It is not a fully-fledged BI dashboarding capability but has enough to get you the visualization you are looking for.
More Capacity: If you are looking for more capacity (storage & query concurrency) than what ships today, then you need to contact an Adobe rep.

Scenario

The files that you will need are the following

Follow the steps for importing smaller-sized CSV files (1GB or less) as datasets outlined here:

pageProfile 101: Data Distillation for Movie Genre Targeting with Email Lists

Create a Database and a Schema

Let us first create a database called testexample

CREATE DATABASE testexample WITH (TYPE=QSACCEL, ACCOUNT=acp_query_batch);

Let us take a look at the definition in the SQL metadata commands:

TYPE=QSACCEL: This specifies that the database and the schemas are custom-built and reside in the customizable layer of the Accelerated Store
ACCOUNT=acp_query_batch: This specifies that you have the Data Distiller license in order to create the tables.

Note: The customization feature works only if you have the Data Distiller license.

We will create two logical partitions (just because we want to) to separate this database into fact and lookup schemas

CREATE schema testexample.lookups;
CREATE schema testexample.facttables;

Note that if you made a mistake in creating a database or schemas, you can do the following:

DROP DATABASE IF EXISTS testexample; 
DROP SCHEMA IF EXISTS testexample.lookups CASCADE;
DROP SCHEMA IF EXISTS testexample.facttables CASCADE;

The keyword CASCADE is a helpful keyword because it not just deletes the schemas but also deletes the underlying tables.

Create Tables

Tables actually carry the data we will need for our dashboarding/reporting use cases. Unlike the data lake tables, you will need to predefine the data types for these tables.

Let us define the lookups table first:

CREATE TABLE testexample.lookups.country_lookup AS
       SELECT
          cast(null as string) country_code,
          cast(null as string) country_name
       WHERE false;

Let us define the fact tables first:

CREATE TABLE testexample.facttables.crm_table AS
       SELECT
          cast(null as int)    purchase,
          cast(null as string)   country_code
WHERE false;

If you ever need to surgically drop a table, you can use the following:

DROP TABLE IF EXISTS testexample.lookups.country_lookup;

DROP TABLE IF EXISTS testexample.facttables.crm_table;

Tip: Postgres SQL treats names of objects (database, schemas, and tables) in a case-insensitive manner. If you use uppercase letters in naming, it will treat it as if its in lowercase.

Build Relationships Between Fact and Dimension Tables

The code is specified in the following way:

ALTER TABLE crm_table ADD CONSTRAINT FOREIGN KEY (country_code) REFERENCES country_lookup(country_code) NOT enforced;

Warning: If you use the full path specification such as testexample.facttables.crm_table and testexample.lookups.country_lookup in the ALTER TABLE command, you will get errors.

The key points to note are the following:

crm_table is the core fact table that has a key relationship on country_code
REFERENCES means that crm_table is joining in country_lookup on the country_code key.
NOT enforced means that the JOIN can be a 1-to-many on the country_lookup table. If there are two values for a country_code to country_name, we will do the JOIN on both. It is your responsibility to make sure that the country_lookup does not have duplicates for a single country_code value.

Hydrate the Tables

Let us hydrate the tables

INSERT INTO testexample.facttables.crm_table (country_code, purchase)
(SELECT country, purchase_amount FROM purchases_dataset_crm_data)

INSERT INTO testexample.lookups.country_lookup(country_code, country_name)
(select country_code, country from(
select count(*), country_code, country from country_codes
group by country_code, country))

Note the following:

The order of the specification of the columns (country_code, purchase) in the crm_table destination table needs to match the same order of the columns (country, purchase_amount) in the source table purchases_dataset_crm_data.
The order of the specification of the columns (country_code, country_name) in the country_lookup destination table needs to match the same order of the columns (country_code, country) in the source table country_codes.

Let us test some queries:

select * from testexample.facttables.crm_table 
select * from testexample.lookups.country_lookup

The results are the following:

Resolve Country ID

SELECT a.purchase,a.country_code,b.country_name FROM crm_table a
INNER JOIN country_lookup b ON a.country_code = b.country_code

The results will be:

Aggregation Queries

Let us run the aggregation query. Our goal again is to combine records from two tables whenever there are matching values in a field common to both tables which is equivalent to an INNER JOIN.

SELECT sum(purchase) AS purchase, country_name FROM testexample.facttables.crm_table 
INNER JOIN testexample.lookups.country_lookup ON testexample.facttables.crm_table.country_code = testexample.lookups.country_lookup.country_code
GROUP BY country_name;

The results will be:

Last updated 7 months ago