Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

fjetter · 2024-02-08T10:38:29Z

We're calculating a checksum of the parquet file here https://github.com/dask-contrib/dask-expr/blob/d1c4ed1da01642df6802881d62998a5a81519b85/dask_expr/io/parquet.py#L550-L567 that relies on the fsspec dir_cache. This is implemented for the ordinary S3FS filesystem but not for the arrow wrapper. There are possibly other implementations where this fails as well.

Without this cache, this checksum is not feasible.

This was introduced in #798

fjetter · 2024-02-08T13:12:42Z

FWIW I'm currently running with the patch below to run tests

@@ -559,12 +677,12 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO):
         else:
             files_for_checksum = dataset_info["ds"].files

-        for file in files_for_checksum:
-            # The checksum / file info is usually already cached by the fsspec
-            # FileSystem dir_cache since this info was already asked for in
-            # _collect_dataset_info
-            checksum.append(fs.checksum(file))
-        dataset_info["checksum"] = tokenize(checksum)
+        # for file in files_for_checksum:
+        #     # The checksum / file info is usually already cached by the fsspec
+        #     # FileSystem dir_cache since this info was already asked for in
+        #     # _collect_dataset_info
+        #     checksum.append(fs.checksum(file))
+        dataset_info["checksum"] = tokenize(files_for_checksum)

         # Infer meta, accounting for index and columns arguments.
         meta = self.engine._create_dd_meta(dataset_info)

fjetter mentioned this issue Feb 26, 2024

Parquet reader using Pyarrow FileSystem #882

Merged

fjetter closed this as completed in #882 Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

fjetter commented Feb 8, 2024 •

edited

Loading

fjetter commented Feb 8, 2024

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

Comments

fjetter commented Feb 8, 2024 • edited Loading

fjetter commented Feb 8, 2024

fjetter commented Feb 8, 2024 •

edited

Loading