What is the Dataset Structuring Engine? #846

ghalimi · 2022-06-22T19:21:32Z

ghalimi
Jun 22, 2022
Maintainer

Definition Dataset Structuring Engine Architecture

ghalimi · 2022-06-22T19:25:12Z

ghalimi
Jun 22, 2022
Maintainer Author

The Dataset Structuring Engine is responsible for properly structuring a dataset before its analysis. This includes:

Partitioning the dataset across one or more dimensions.
Computing and caching indexes for all columns.
Computing and caching summary statistics for all columns.

Prerequisites

Before a dataset can be handled by the Dataset Structuring Engine, it must be:

Stored on the Object Store using objects serialized in the Parquet format.
Partitioned using the Hive style.
Registered in the Data Catalog.

These operations are usually best handled by a Data Architect using the Data Integration Engine.

Partitioning Configuration

Once a dataset has been prepared for the Dataset Structuring Engine as described above, the Data Architect must select the columns that should be used for partitioning. A dataset can include one or multiple partitionings, and each partitioning can be defined with one or multiple columns. For example, a temporal partitioning could be defined with a timestamp column, while a geospatial partitioning could be defined with a pair of latitude and longitude columns.

When a single column is used for defining a partitioning, its range of possible values must be such that it can be binned (discretized) in such a way that subdividing the dataset into a number of partitions equal to the number of bins leads to compressed Parquet partitions which average size is about 1GB. When multiple columns are used for defining a partitioning, the range of possible combinations of values must follow the same set of constraints.

Furthermore:

The dataset's initial partitioning does not have to be included in the partitionings defined by the Data Architect.
At least one partitioning must be defined.
Whenever possible, two partitionings should be defined.
Up to three partitioning can be defined.

Parquet Sub-Partitioning

When two or three partitionings are defined, rows within individual Parquet partitions are grouped using Parquet row groups defined across another partitioning dimension, and as many sets of partitions are stored on the Object Store as there are pairs of partitionings. Therefore, if a single partitioning is defined, a single set of partitions is stored. If two partitionings are defined, two sets of partitions are stored. And if three partitionings are defined, six sets of partitions are stored. For example, if partitionings are defined across dimensions A, B, and C, the following sets of partitions are stored:

Partitions across A with row groups across B
Partitions across A with row groups across C
Partitions across B with row groups across A
Partitions across B with row groups across C
Partitions across C with row groups across A
Partitions across C with row groups across B

Multi-Partitioned Summaries

Summary statistics and summary charts are systematically computed for every column of the structured dataset, for every partition. Furthermore, when multiple partitionings are produced, these summaries are systematically computed for every partition|row-group pair in relation to every pair of partitionings. This OLAP-style pre-computation of multi-dimensional aggregations ensures that all column-level summaries can be quickly reduced from partition|row-group summaries for any selection of partitions and row-groups.

Requirements

The Dataset Structuring Engine must be:

Able to handle datasets up to 1PB in size.
As fast as possible.
As cost effective as possible.
Fully automated.
Easy to implement and maintain.

Reference Dataset

For design purposes, we consider the following hypothetical Reference Dataset:

1PB uncompressed (100TB in compressed Parquet format)
20 columns | 10 trillion rows
100,000 partitions of 1GB each

Repartitioning

The biggest challenge faced by the Dataset Structuring Engine is the repartitioning (shuffling) of a large dataset. For example, the Reference Dataset could be a set of geospatial period series partitioned over time that needs to be repartitioned over space. In this particular example, we need to generate 100,000 new partitions from the set of 100,000 original partitions.

Because the dataset is so large, its repartitioning is handled in two steps running on multiple machines in parallel.

1. Macro-Partition Generation

This first phase consists in generating a set of macro-partitions from the original partitions.

The Dataset Structuring Engine is implemented using DuckDB. Unfortunately, this great engine has two critical limitations:

The reading of files is not parallelized.
The export of Parquet files does not support partitioning.

Therefore, the Dataset Structuring Engine uses multiple parallel processes, each running its own instance of the DuckDB engine and generating a single file. Furthermore, data is loaded in memory to accelerate read and write operations and remains in compressed form to reduce the amount of memory required for the workload. Nevertheless, the dataset remains too large to be loaded using a single machine. Therefore, multiple machines must be used. When using Amazon Web Services, the x2idn.32xlarge instance is best suited to the task, thanks to its wide bandwidth and high vCPU-to-RAM ratio:

128 vCPUs
2,048GB of memory
100Gbps of network bandwidth

50 of these instances are needed for 100TB of compressed data. The loading of data from the Object Store to the machines' memory is done in parallel and should take about 200s assuming a 80Gbps effective bandwidth per machine. 100 processes are then launched on every machine, each scanning the entire dataset once, and generating a single macro-partition containing all the data for 1,000 final partitions and being 20GB in size. These partitions are written on the fly on the Object Store.

5,000 macro-partitions of 20GB are now available on the Object Store.

2. Final Partition Generation

This second phase consists in consolidating sets of macro-partitions and splitting them into final partitions. Because the first phase was done in parallel, each logical macro-partition of 1TB is split across 50 physical macro-partitions of 20GB. Therefore a machine with 1TB of RAM would be perfectly suited to the task. Furthermore, because final partitions will require the sorting of columns and the computing of summary statistics (two extremely compute-intensive tasks), the use of an accelerated instance with GPUs is recommended. When using Amazon Web Services, the p4de.24xlarge instance is ideal:

96 vCPUs
1,152GB of CPU memory
400Gbps of network bandwidth
640GB of GPU memory
49,536 CUDA Cores
3,456 Tensor Cores

100 of these instances are required, and 1,000 final partitions of 1GB are generated by each machine. This is achieved by assigning batches of 10 to 11 partitions to 96 parallel processes on each machine. Since the instance has 152GB more RAM than required for the macro-partitions it is holding, 96 final partitions can easily be stored in memory before being written to the Object Store. This allows such partitions to be processed in parallel by the GPUs for sorting their columns and computing their summary statistics, which are also stored on the Object Store.

🗒️ Note: If so many machines cannot be provisioned at the same time, each parallel step can be made sequential.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the Dataset Structuring Engine? #846

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What is the Dataset Structuring Engine? #846

ghalimi Jun 22, 2022 Maintainer

Replies: 1 comment

ghalimi Jun 22, 2022 Maintainer Author

Prerequisites

Partitioning Configuration

Parquet Sub-Partitioning

Multi-Partitioned Summaries

Requirements

Reference Dataset

Repartitioning

1. Macro-Partition Generation

2. Final Partition Generation

ghalimi
Jun 22, 2022
Maintainer

ghalimi
Jun 22, 2022
Maintainer Author