Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

Closed
fjetter opened this issue Feb 8, 2024 · 1 comment · Fixed by #882
Closed

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

fjetter opened this issue Feb 8, 2024 · 1 comment · Fixed by #882

Comments

@fjetter
Copy link
Member

fjetter commented Feb 8, 2024

We're calculating a checksum of the parquet file here https://github.com/dask-contrib/dask-expr/blob/d1c4ed1da01642df6802881d62998a5a81519b85/dask_expr/io/parquet.py#L550-L567 that relies on the fsspec dir_cache. This is implemented for the ordinary S3FS filesystem but not for the arrow wrapper. There are possibly other implementations where this fails as well.

Without this cache, this checksum is not feasible.

This was introduced in #798

@fjetter
Copy link
Member Author

fjetter commented Feb 8, 2024

FWIW I'm currently running with the patch below to run tests

@@ -559,12 +677,12 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO):
         else:
             files_for_checksum = dataset_info["ds"].files

-        for file in files_for_checksum:
-            # The checksum / file info is usually already cached by the fsspec
-            # FileSystem dir_cache since this info was already asked for in
-            # _collect_dataset_info
-            checksum.append(fs.checksum(file))
-        dataset_info["checksum"] = tokenize(checksum)
+        # for file in files_for_checksum:
+        #     # The checksum / file info is usually already cached by the fsspec
+        #     # FileSystem dir_cache since this info was already asked for in
+        #     # _collect_dataset_info
+        #     checksum.append(fs.checksum(file))
+        dataset_info["checksum"] = tokenize(files_for_checksum)

         # Infer meta, accounting for index and columns arguments.
         meta = self.engine._create_dd_meta(dataset_info)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant