EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller
It is important for you to understand how the data ingestion process works and why interrogating the records ingested in a batch may be an important tool in your arsenal to address downstream issues.
Last updated
Data Ingestion Primer
One of the key questions that you will need to answer at some point is verifying and validating the records within a batch that has been successfully ingested into the Adobe Experience Platform.
Remember that the concept of "batch" is a data ingestion concept where a collection of records contained in a file or otherwise, batch or streaming is materialized as a "unit" on the data lake. In essence, it is a materialization construct used by AEP.
Records that are ingested have to pass through several checks before such materialization can take place. This is handled during the mapping part of the data ingestion process. There are several categories of issues that can arise and you need to be aware of them. They will manifest themselves with error codes if you peek into a dataset
Navigate to the Datasets pane and if you are unlucky, click on a batch that has failed:
You will see a bunch of errors that look like this perhaps:
Some bad things have happened to our data ingestion. Let us understand the error codes:
ERROR: These are the most egregious of errors possible where data corruption or non-conformance to a format was not followed. Such types of failures are serious and the entire batch will fail.
DCVS: Not seen in the above example but these are less serious than data corruption issues such as a missing required field. All of these rows are just skipped. A separate dataset containing such records is NOT available as a dataset on the data lake. These records are kept in a separate location and accessible through the error diagnostics tools (UI or API). The reality of dealing with such situations is that if those skipped records are critical for your use case, you will need surgically identify them in the source system and re-ingest the data. And if that is
MAPPER: These appear to be the least harmful of the three but you need to pay attention to them because these are rows that make it to the final dataset BUT the data may have been altered in the process. The mapping process tries to do a data type conversion of the string data that is at its input to the output datatype. When it cannot do so because of a malformed string, it will NULLs in the result. If you were not paying attention, you now have a field that has been NULLs that possibly could have been rectified by you. Thus batches with MAPPER warnings become a good candidate for some data exploration to see what is going on.
Accessing Dataset Batch Metadata
In order to see what system fields are available in the dataset, set the following in a session:
set drop_system_columns=false;
select*from movie_data
By doing so, you will see two new columns that will appear to the far right: acp_system_metadata and _ACP_BATCHID.
As data gets ingested into the platform, a logical partition is assigned to the data based on what data is coming at the input. _acp_system_metadata.sourceBatchId is a logical partition and _ACP_BATCHID is a physical partition after the data has been mastered into the data lake in Adobe Experience Platform.
Let us execute the following query:
select _acp_system_metadata, count(distinct _ACP_BATCHID) from movie_datagroup by _acp_system_metadata
The results are:
This means that number of batches in the input need not correspond to the number of batches written. In fact, the system decides the most efficient way to batch and master the data onto the data lake. Let me explain this through an example below.
Let's run this on a different dataset below. For those of you who are motivated, you need to ingest this data using XDM mapping into the Adobe Experience Platform.
This file is a deeply nested set of 35,000 records and they look like this:
select*from drug_orders
Let us generate some batch-based statistics on this dataset:
select _acp_system_metadata, count(distinct _ACP_BATCHID) as numoutputbatches, count( _ACP_BATCHID) as recordcount from drug_orders
group by _acp_system_metadata
The answers look like this:
The above shows that I created 3 input batches where I ingested 2000, 24000, and 9000 records each time. However, when they got mastered, there was only one unique batch each time.
Remember that all records visible within a dataset are the ones that successfully got ingested. That does not mean that all the records that were sent at the source input are present. You will need to look at the data ingestion failures to find the batches/records that did not make it in.
Querying a Batch in a Dataset
If you want to simulate the creation of a batch go to Movie Genre Targeting Example and complete the section on ingesting CSV files.
If you open up the dataset pane, you will see this:
Copy the batch ID by going to the panel on the right:
Now use the following query to retrieve all the records that made it into the dataset as part of that batch:
_ACP_BATCHID is the keyword that we will be used to filter the Batch ID. The LIMIT clause is useful if you want to restrict the number of rows displayed. A filter condition is more desirable.
If you executed this query in the Query Editor in the Adobe Experience Platform, the results will be truncated at 100 rows. The editor was designed as a quick preview tool. To get up to 50,000 rows, you need to use a third-party tool like DBVisualizer (my favorite). DBeaver is also another tool used by all. Keep in mind, that these editor tools are advanced and mostly free.