PREP 302: Key Topics Overview: Architecture, MDM, Personas
Last updated
Last updated
To understand the architecture of a Data Distiller, it is important to understand a few things:
Adobe Experience Platform is built on a Service-Oriented Architecture foundation. What that means is that every component is a separate service and can talk to others and be talked to.
Query Service is the service name of the SQL capabilities in the Adobe Experience Platform.
Data Distiller is the packaging of these capabilities that is sold to customers. There are some Data Distiller capabilities that are given as part of the Apps themselves. To understand what comes with the app and what comes in the Data Distiller standalone, you will need to talk to an Adobe rep.
If you have the Data Distiller product, you have all of these capabilities in one place. For this book, we will assume that you indeed do.
For the rest of this discussion, we will be talking about Query Service architecture so that you know what pieces are involved and why the query execution behaves the way it does.
There are 3 query engine implementations in Data Distiller each tuned for a specific set of use cases that gives a lot of flexibility to address a wide spectrum of customer data processing and insights use cases.
The query engine implementations are:
Ad Hoc Query Engine: This query engine implementation enables users to type SELECT queries on the structured and unstructured data in the data lake. The scale of data being queried is way larger than what you would query in your warehouse. Queries time out after 10 minutes of execution (waiting time is not included). The system auto-scales as more users come into the system so that they are not waiting for cluster initialization time. If you use TEMP tables for data exploration, the data and the results can be cached.
Batch Query Engine: This is a batch processing engine implementation that creates or adds new data to the data lake. In this case, depending on the query and the size of the data to be processed, we spin up a separate cluster with the required resources for the execution of the query. Tee SQL queries CREATE TABLE AS and INSERT INTO will use this engine. This is very similar to the "T" step in the ETL step you will see in state-of-the-art ETL engines. Queries can execute for a maximum of 24 hours with no limits on the concurrency of jobs (scheduled or otherwise).
Accelerated Query Engine: This is an MPP engine that is designed specifically to address BI dashboard-style queries which has it's own accelerated store. The query engine along with the store is called the Data Distiller Warehouse. This is very similar to what you would see in state-of-the-art warehousing engines. Results do get cached and reused across other similar queries. User concurrency is limited but there are limits on the query concurrency (4) and the size of the data (1TB) today.
Let us now look at the routing of the various kinds of queries.
All queries that use SELECT in the main query are essentially "read from" queries that are either executing subqueries or complex conditions.
If you look at the above diagram, it means that you can either read large datasets from the Data Lake via the Ad Hoc Query engine path or you could read compact aggregated datasets from the Accelerated Store. Here is how you would differentiate between the queries:
All datasets across Data Lake and Accelerated Store are treated as if they belong to the same storage layer. This means that the dataset names are unique across these data layers. It also means that by looking at a dataset or table name, you cannot make out where it is located. You don't need to as the Data Distiller engine routes the query automatically.
All datasets in the Accelerated Store have to be created with the following declaration clause:
If you want to know which dataset is where simply type:
The results will look like this in DBVisualizer:
If the description says "Data Warehouse" table, it means that it is in the Accelerated Store. If it says "null", it means that it is on the Data Lake. Accelerated Store tables will be queried via the Query Accelerated Engine. Data Lake tables will be queried via the Ad Hoc Query Engine.
Hint: Another way to detect if a table is on the Data Lake or Accelerated Store is to see if it is a flat table or not. If it is a nested or complex table, then it is on the Data Lake. Accelerated Store requires that datasets or tables be flat as it supports only relational structures.
Any SQL statement that contains "CREATE TABLE AS" or "INSERT INTO" will be routed to the Batch Query Engine.
The batch query engine can write to the Data Lake or the Accelerated Store. The data layer it writes to is based on the same condition as the route path for reading and writing the tables. If the table to be written to exists on the Accelerated Store, it will do so.
Note: Data Distiller allows you to mix and match tables in your query across the Data Lake and Accelerated Store. This means you can reuse the results of your work in the Accelerated Store to create richer datasets
Data Distiller SQL conforms to the Postgres SQL syntax. PostgreSQL is compliant with ANSI SQL standards. It is compatible with ANSI-SQL2008 and supports most of the major features of SQL:2016. However, the syntax accepted by PostgreSQL is slightly different from commercial engines. SQL is a popular relational database language that was first standardized in 1986 by the American National Standards Institute (ANSI). In 1987, the International Organization for Standardization (ISO) adopted SQL as an international standard.
Master Data Management (MDM) is a method and a set of tools used to manage an organization's critical data. MDM focuses on ensuring that essential data is consistently defined, shared, and used throughout an organization, which can help improve data quality, streamline data integration, and enable more accurate reporting and analytics. Data Distiller is not an MDM tool but it has features that can replicate MDM-like features on datasets in the data lake in the Adobe Experience Platform.
Data Scope: Note that MDM covers the entire enterprise data while the scope of data that can be covered by Data Distiller is only the data brought into the Adobe Experience Platform. Hence, the MDM-like functionality is restricted to the data that is available.
MDM Concept | Supported? | Data Distiller Implementation |
---|---|---|
Data Governance: MDM involves establishing data governance policies and procedures to ensure that data is accurate, consistent, and secure. MDM helps organizations comply with data privacy regulations, such as GDPR or HIPAA, by ensuring that sensitive data is properly managed and protected. | Yes | Data Governance in Data Distiller is always within the context of the Data Lake, Accelerated Store, and the Apps (Adobe Real-Time CDP, etc.). Compliance with GDPR and HIPAA are supported. |
Data Quality: MDM aims to improve data quality by cleansing and standardizing data. | Yes but manual | You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets. |
Data Matching and Deduplication: MDM tools use algorithms to identify and merge duplicate records | Yes but manual | You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets. |
Data Enrichment: MDM can involve enriching data with additional information. For example, appending geographical coordinates to customer addresses to enable location-based analytics. | Yes but manual | You will need to implement this per dataset. You can templatize the logic and reuse it for multiple datasets. |
Data Integration: MDM helps integrate data from various sources, making it accessible and usable across the organization. | Yes | This is covered by the Sources functionality in Adobe Experience Platform. When you get a license to an App, you get access to the same set of sources. Data Distiller can leverage the same input data sources. |
Hierarchical Data Management: MDM can manage hierarchical relationships, such as product categories and subcategories. | Yes | XDM modeling gives you the flexibility to model a wide range of relationships on the data lake. The closest Data Distiller gets is with star or snowflake schema modeling with primary and secondary key relationships between datasets. |
Customer 360: One common example is building a "Customer 360" view, where all relevant customer information, including demographics, purchase history, and support interactions, is consolidated into a single, unified profile. | Yes but manual | This is supported by the Real-Time Customer Profile and hence Data Distiller has access to the same data. |
Product Information Management (PIM): In e-commerce and retail, MDM is used to manage product data, ensuring consistent and complete product information across various sales channels. | Limited | Data Distiller's functionality is similar to that of an OLAP database to an OLTP database. You cannot UPDATE records. |
Supplier Data Management: In supply chain management, MDM can be used to maintain accurate and up-to-date information about suppliers, including contact details, certifications, and performance metrics. | Limited | Data Distiller's functionality is similar to that of an OLAP database to an OLTP database. You cannot UPDATE records. |
Financial Data Management: MDM can be applied to financial data, ensuring that financial reports and statements are based on accurate and consistent data from various sources. | Limited | Data Distiller's functionality is similar to that of an OLAP database than an OLTP database. You cannot UPDATE records. |
Centralized User Experience for Master Data Management use cases | Not supported | Data Distiller is still a data processing and analytics tool. |
Operations | Data Lake | Accelerated Store |
---|---|---|
CREATE | Supported: you can replace a dataset or add new batches of data. | Supported |
READ | Supported | Supported |
UPDATE | Not supported as the unit of update is a "batch of records" in the data lake. You will need to replay the data. | Supported |
DELETE | Record-level delete is not supported, dataset level delete is supported. You will need to replay the data in order to delete the records you do not want. | Supported |
One of the patterns that you will see in the world of data is the convergence of multiple domain expertise into one. The overlaps are very strong and the traditional thinking that one area of expertise is the future (such as AI engineers will be the future, data science will replace analysis) is misguided and wrong. You can give/have all the fancy titles you want for your team but you will need a team to pull off these tasks. Focus on the expertise they bring rather than their persona. Your team will be lacking some of these and that should be an area of investment for you.