You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FWIW I'm currently running with the patch below to run tests
@@ -559,12 +677,12 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO):
else:
files_for_checksum = dataset_info["ds"].files
- for file in files_for_checksum:- # The checksum / file info is usually already cached by the fsspec- # FileSystem dir_cache since this info was already asked for in- # _collect_dataset_info- checksum.append(fs.checksum(file))- dataset_info["checksum"] = tokenize(checksum)+ # for file in files_for_checksum:+ # # The checksum / file info is usually already cached by the fsspec+ # # FileSystem dir_cache since this info was already asked for in+ # # _collect_dataset_info+ # checksum.append(fs.checksum(file))+ dataset_info["checksum"] = tokenize(files_for_checksum)
# Infer meta, accounting for index and columns arguments.
meta = self.engine._create_dd_meta(dataset_info)
We're calculating a checksum of the parquet file here https://github.com/dask-contrib/dask-expr/blob/d1c4ed1da01642df6802881d62998a5a81519b85/dask_expr/io/parquet.py#L550-L567 that relies on the fsspec dir_cache. This is implemented for the ordinary S3FS filesystem but not for the arrow wrapper. There are possibly other implementations where this fails as well.
Without this cache, this checksum is not feasible.
This was introduced in #798
The text was updated successfully, but these errors were encountered: