From 4f1e04a33fac0cb204f73e79c5a23a403f78164a Mon Sep 17 00:00:00 2001 From: Matthew Powers Date: Sun, 21 Jan 2024 13:07:39 -0500 Subject: [PATCH] docs: delta lake file skipping --- .../delta-lake-file-skipping.md | 183 ++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 184 insertions(+) create mode 100644 docs/how-delta-lake-works/delta-lake-file-skipping.md diff --git a/docs/how-delta-lake-works/delta-lake-file-skipping.md b/docs/how-delta-lake-works/delta-lake-file-skipping.md new file mode 100644 index 0000000000..f6562e3952 --- /dev/null +++ b/docs/how-delta-lake-works/delta-lake-file-skipping.md @@ -0,0 +1,183 @@ +# Delta Lake File Skipping + +Delta tables store file-level metadata information, which allows for a powerful optimization called file skipping. + +This page explains how Delta Lake implements file skipping, how to optimize your tables to maximize file skipping, and the benefits of file skipping. + +Let’s start by looking at the file-level metadata in Delta tables. + +## Delta Lake file metadata + +Delta Lake stores metadata about each file's min/max values in the table. Query engines can skip entire files when they don’t contain data that’s relevant to the query. + +Suppose you have a Delta table with data stored in two files and has the following metadata. + +``` +filename min_name max_name min_age max_age +fileA alice joy 12 46 +fileB allan linda 34 78 +``` + +Suppose you want to run the following query: `select * from the_table where age < 20`. + +The engine only needs to read `fileA` to execute this query. `fileB` has a `min_age` of 34, so we know there aren’t any rows of data with an `age` less than 20. + +The benefit of file skipping depends on the query and the data layout of the Delta table. Some queries cannot take advantage of any file skipping. Here’s an example query that does not benefit from file skipping: `select * from the_table group by age`. + +Let’s recreate this example with Polars to drive the point home. + +Start by writing out one file of data: + +```python +import polars as pl +from deltalake import DeltaTable + +df = pl.DataFrame({"name": ["alice", "cat", "joy"], "age": [12, 35, 46]}) +df.write_delta("tmp/a_table") +``` + +Now, write out another file of data: + +```python +df = pl.DataFrame({"name": ["allan", "brian", "linda"], "age": [34, 35, 78]}) +df.write_delta("tmp/a_table", mode="append") +``` + +Here are the contents of the Delta table: + +``` +tmp/a_table +├── 0-7d414a88-a634-4c2f-9c5b-c29b6ee5f524-0.parquet +├── 1-0617ef60-b17b-46a5-9b0f-c7dda1b73eee-0.parquet +└── _delta_log + ├── 00000000000000000000.json + └── 00000000000000000001.json +``` + +Now run a query to fetch all the records where the age is less than 20: + +```python +pl.scan_delta("tmp/a_table").filter(pl.col("age") < 20).collect() +``` + +``` ++-------+-----+ +| name | age | +| --- | --- | +| str | i64 | ++=============+ +| alice | 12 | ++-------+-----+ +``` + +Polars can use the Delta table metadata to skip the file that does not contain data relevant to the query. + +## How Delta Lake implements file skipping + +Here’s how engines execute queries on Delta tables: + +* Start by reading the transaction log to get the file paths, file sizes, and min/max value for each column +* Parse the query and push down the predicates to skip files +* Read the minimal subset of the files needed for the query + +Some file formats don’t allow for file skipping. For example, CSV files don’t have file-level metadata, so query engines can’t read a minimal subset of the data. The query engine has to check all the files, even if they don’t contain any relevant data. + +When data is in Parquet files, the query engine can open up all the files, read the footers, build the file-level metadata, and perform file skipping. Fetching metadata in each file is slower than grabbing the pre-built file-level metadata from the transaction log. + +Now, let’s see how to structure your tables to allow for more file skipping. + +## File skipping for different file sizes + +Delta tables store data in files. Smaller files allow for more file skipping compared to bigger files. + +However, an excessive number of small files isn’t good because it creates I/O overhead and slows down queries. + +Your Delta tables should have files that are “right-sized”. For a table with 150 GB of data, 5 GB files would probably be too large, and 10 KB files would be too small. It’s generally best to store data in files that are between 100 MB and 1 GB. + +Delta Lake has [an optimize function](https://delta-io.github.io/delta-rs/usage/optimize/small-file-compaction-with-optimize/) that performs small file compaction, so you don’t need to program this logic yourself. + +Now, let's investigate how to store data in files to maximize the file skipping opportunities. + +## How to maximize file skipping + +You can maximize file-skipping by colocating similar data in the same files. + +Suppose you have a table with test scores and frequently run queries that filter based on the `test_score` column. + +``` +filename min_test_score max_test_score +fileA 45 100 +fileB 65 98 +fileC 50 96 +``` + +Suppose you want to run the following query: `select * from exams where test_score > 90`. + +This query cannot skip files, given the current organization of the data. You can rearrange the data to colocate similar test scores in the same files to allow for file skipping. Here’s the new layout: + +``` +filename min_test_score max_test_score +fileD 45 70 +fileE 55 80 +fileF 78 100 +``` + +The query (`select * from exams where test_score > 90`) can skip two of the three files with the new Delta table layout. The query engine only has to read `fileF` for this query. + +Now, let’s look at how file skipping works with string values. + +## How file skipping works with strings + +File skipping is also effective when filtering on string values. + +Suppose you have a table with `person_name` and `country` columns. There are millions of rows of data. Here are the first three rows of data: + +``` +person_name country +person1 angola +person2 china +person3 mexico +``` + +The Delta table contains three files with the following metadata: + +``` +filename min_country max_country +fileA albania mali +fileB libia paraguay +fileC oman zimbabwe +``` + +Suppose you want to run the following query: `select * from some_people where country = 'austria'`. + +You only need to read the data in `fileA` to run this query. The `min_country` value for `fileB` and `fileC` are greater than “austria”, so we know those files don’t contain any data relevant to the query. + +File skipping can also be a robust optimization for string values. Now, let’s see how file skipping works for partitioned tables. + +## File skipping for partitioned tables + +You can partition Delta tables for file skipping as well. Suppose we have the same data as in the previous section, but the table is partitioned by `country`. + +Here’s the Delta table: + +``` +filename partition +fileA albania +fileB libia +fileC oman +fileD jamaica +fileE albania +fileF oman +``` + +Suppose you want to run the following query on this partitioned table: `select * from some_partitioned_table where country = 'albania'`. + +You only need to read `fileA` and `fileE` to execute this query. Delta Lake provides the file-level partition metadata in the transaction log so that this query will run quickly. + +## Conclusion + +Delta Lake allows for file skipping, which is a powerful performance optimization. + +Delta Lake also provides built-in utilities to colocate data in the same files like partitioning, Z Ordering, and compaction to improve file skipping. + +Delta Lake users need to know how to assess the tradeoffs of these techniques to optimize file skipping. Users also need to understand the most frequent query patterns of their tables to best allow for file maximal file skipping. diff --git a/mkdocs.yml b/mkdocs.yml index 971cbfe103..b657434a25 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -83,6 +83,7 @@ nav: - How Delta Lake works: - Architecture: how-delta-lake-works/architecture-of-delta-table.md - Transactions: how-delta-lake-works/delta-lake-acid-transactions.md + - File skipping: how-delta-lake-works/delta-lake-file-skipping.md not_in_nav: | /_build/