Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document ObjectStore::list Ordering #3975

Closed
tiago-ssantos opened this issue Mar 29, 2023 · 1 comment · Fixed by #3973
Closed

Document ObjectStore::list Ordering #3975

tiago-ssantos opened this issue Mar 29, 2023 · 1 comment · Fixed by #3973
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@tiago-ssantos
Copy link

Describe the bug
In DataFusion, when listing all files (https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/url.rs#L151) using an LocalFileSystem object store the result is different depending the OS.

To Reproduce
Having a folder:
image

and requesting to list the content of the folder using:

async fn list_test() {
    let path = Path::from_filesystem_path("./files").unwrap();

    let integration = LocalFileSystem::new();
    let list_stream = integration.list(Some(&path)).await.unwrap();

    let res: Vec<_> = list_stream.try_collect().await.unwrap();
    res.iter().for_each(|file| println!("{0}", file.location));
}

in windows/ubuntu the result is:

files/file1.parquet
files/file3.parquet

but in macOS Ventura:

files/file3.parquet
files/file1.parquet

Expected behavior
We expect that the result would be the same. This code is called when inferring the schema (https://github.com/apache/arrow-datafusion/blob/c8a3d589889dd1e67047de89db8b4ff56f90f04c/datafusion/core/src/datasource/listing/table.rs#L431) and the ordering for multiple files is important, as it does a merge of the schemas of all the files.

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Mar 29, 2023
@tustvold tustvold changed the title LocalFileSystem::list returning different results from different OS Return sorted results from ObjectStore::list Mar 29, 2023
@tustvold
Copy link
Contributor

tustvold commented Mar 29, 2023

The returned order is not defined, as filesystems and object stores have different notions of what sorting means, lexicographic by full path or by path segment. Additionally many filesystems provide no guarantees on output ordering at all, in fact Windows is the only one that does IIRC.

If you require a consistent sort order I would recommend collecting the results, or using list_with_delimiter, and sorting the output

This definitely should be highlighted more clearly in the docs

@tustvold tustvold changed the title Return sorted results from ObjectStore::list Document ObjectStore::list Ordering Mar 29, 2023
@tustvold tustvold added the object-store Object Store Interface label Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants