Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File paths get duplicated by list_files_by_fsspec pipeline if folder path starts with az:// #840

Closed
sgrigory opened this issue Oct 19, 2022 · 0 comments
Labels
good first issue Good for newcomers

Comments

@sgrigory
Copy link

🐛 Describe the bug

Context

For reading files from Azure Blob storage Gen2, fsspec allows both path prefixes abfs:// or az:// as synonyms. Paths starting abfs:// work fine for us, but paths starting with az:// result in duplicated output when passed to list_files_by_fsspec pipeline.

This first showed up in #836

Example

from torchdata.datapipes.iter import IterableWrapper

dp = IterableWrapper(["az://public/curated/covid-19/"]).list_files_by_fsspec(account_name="pandemicdatalake")
list(dp)[0]

Output:

'public/curated/covid-19/public/curated/covid-19/bing_covid-19_data'

instead of the correct one

'public/curated/covid-19/bing_covid-19_data'

Possible reason

This probably has to do with how FSSpecFileListerIterDataPipe.__iter__ decides if the path is local:

  • If a path starts with az, the variable fs.protocol here is still abfs
  • So the condition root.startswith(protocol) is false, and is_local is true
  • As a result the path "doubles" in this line

Perhaps we'd need to find a different way of checking if the path is local, not relying on matching the beginning of the path with fs.protocol

Versions

Collecting environment information...
PyTorch version: 1.14.0.dev20221018
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.6 (arm64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.102)
CMake version: version 3.22.1
Libc version: N/A

Python version: 3.9.13 (main, Oct 13 2022, 16:12:19) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-12.6-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] mypy==0.982
[pip3] mypy-extensions==0.4.3
[pip3] torch==1.14.0.dev20221018
[conda] pytorch 1.14.0.dev20221018 py3.9_0 pytorch-nightly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants