What is the Dataset Structuring Engine? #846
Replies: 1 comment
-
The Dataset Structuring Engine is responsible for properly structuring a dataset before its analysis. This includes:
PrerequisitesBefore a dataset can be handled by the Dataset Structuring Engine, it must be:
These operations are usually best handled by a Data Architect using the Data Integration Engine. Partitioning ConfigurationOnce a dataset has been prepared for the Dataset Structuring Engine as described above, the Data Architect must select the columns that should be used for partitioning. A dataset can include one or multiple partitionings, and each partitioning can be defined with one or multiple columns. For example, a temporal partitioning could be defined with a timestamp column, while a geospatial partitioning could be defined with a pair of latitude and longitude columns. When a single column is used for defining a partitioning, its range of possible values must be such that it can be binned (discretized) in such a way that subdividing the dataset into a number of partitions equal to the number of bins leads to compressed Parquet partitions which average size is about 1GB. When multiple columns are used for defining a partitioning, the range of possible combinations of values must follow the same set of constraints. Furthermore:
Parquet Sub-PartitioningWhen two or three partitionings are defined, rows within individual Parquet partitions are grouped using Parquet row groups defined across another partitioning dimension, and as many sets of partitions are stored on the Object Store as there are pairs of partitionings. Therefore, if a single partitioning is defined, a single set of partitions is stored. If two partitionings are defined, two sets of partitions are stored. And if three partitionings are defined, six sets of partitions are stored. For example, if partitionings are defined across dimensions
Multi-Partitioned SummariesSummary statistics and summary charts are systematically computed for every column of the structured dataset, for every partition. Furthermore, when multiple partitionings are produced, these summaries are systematically computed for every partition|row-group pair in relation to every pair of partitionings. This OLAP-style pre-computation of multi-dimensional aggregations ensures that all column-level summaries can be quickly reduced from partition|row-group summaries for any selection of partitions and row-groups. RequirementsThe Dataset Structuring Engine must be:
Reference DatasetFor design purposes, we consider the following hypothetical Reference Dataset:
RepartitioningThe biggest challenge faced by the Dataset Structuring Engine is the repartitioning (shuffling) of a large dataset. For example, the Reference Dataset could be a set of geospatial period series partitioned over time that needs to be repartitioned over space. In this particular example, we need to generate 100,000 new partitions from the set of 100,000 original partitions. Because the dataset is so large, its repartitioning is handled in two steps running on multiple machines in parallel. 1. Macro-Partition GenerationThis first phase consists in generating a set of macro-partitions from the original partitions. The Dataset Structuring Engine is implemented using DuckDB. Unfortunately, this great engine has two critical limitations:
Therefore, the Dataset Structuring Engine uses multiple parallel processes, each running its own instance of the DuckDB engine and generating a single file. Furthermore, data is loaded in memory to accelerate read and write operations and remains in compressed form to reduce the amount of memory required for the workload. Nevertheless, the dataset remains too large to be loaded using a single machine. Therefore, multiple machines must be used. When using Amazon Web Services, the x2idn.32xlarge instance is best suited to the task, thanks to its wide bandwidth and high vCPU-to-RAM ratio:
50 of these instances are needed for 100TB of compressed data. The loading of data from the Object Store to the machines' memory is done in parallel and should take about 200s assuming a 80Gbps effective bandwidth per machine. 100 processes are then launched on every machine, each scanning the entire dataset once, and generating a single macro-partition containing all the data for 1,000 final partitions and being 20GB in size. These partitions are written on the fly on the Object Store. 5,000 macro-partitions of 20GB are now available on the Object Store. 2. Final Partition GenerationThis second phase consists in consolidating sets of macro-partitions and splitting them into final partitions. Because the first phase was done in parallel, each logical macro-partition of 1TB is split across 50 physical macro-partitions of 20GB. Therefore a machine with 1TB of RAM would be perfectly suited to the task. Furthermore, because final partitions will require the sorting of columns and the computing of summary statistics (two extremely compute-intensive tasks), the use of an accelerated instance with GPUs is recommended. When using Amazon Web Services, the p4de.24xlarge instance is ideal:
100 of these instances are required, and 1,000 final partitions of 1GB are generated by each machine. This is achieved by assigning batches of 10 to 11 partitions to 96 parallel processes on each machine. Since the instance has 152GB more RAM than required for the macro-partitions it is holding, 96 final partitions can easily be stored in memory before being written to the Object Store. This allows such partitions to be processed in parallel by the GPUs for sorting their columns and computing their summary statistics, which are also stored on the Object Store. 🗒️ Note: If so many machines cannot be provisioned at the same time, each parallel step can be made sequential. |
Beta Was this translation helpful? Give feedback.
-
Definition
Dataset Structuring Engine
Architecture
Beta Was this translation helpful? Give feedback.
All reactions