Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different schemas when inferring from local system in different OS #5779

Closed
tiago-ssantos opened this issue Mar 29, 2023 · 1 comment · Fixed by #6629
Closed

Different schemas when inferring from local system in different OS #5779

tiago-ssantos opened this issue Mar 29, 2023 · 1 comment · Fixed by #6629
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@tiago-ssantos
Copy link

Describe the bug

When inferring a schema, the list_all_files uses an object store to list the files. No sorting is passed.
When the object store is a LocalFileSystem, there isn't an insurance of any file sorting (the return list of a macOs has a different sort of windows). This means that the inferred schema can be different for the same set of files.

We contact the object store (apache/arrow-rs#3975) that point it out that the solution should be implemented in the caller of the method, applying a sort of any type, to maintain consistency between file systems.

To Reproduce

Having two parquet files in the filesystem with the schema:

  • file1.parquet
{
  "type" : "record",
  "name" : "root",
  "fields" : [ {
    "name" : "year",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "description",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "code",
    "type" : [ "null", "long" ],
    "default" : null
  } ]
}
  • file3.parquet
{
  "type" : "record",
  "name" : "root",
  "fields" : [ {
    "name" : "description",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "code",
    "type" : [ "null", "long" ],
    "default" : null
  }, {
    "name" : "year",
    "type" : [ "null", "int" ],
    "default" : null
  } ]
}

and executing:

#[tokio::test]
async fn infer_schema() {
    let path =  ListingTableUrl::parse("./files").unwrap();
    let ctx = SessionContext::new();
    let state = ctx.state();
    let options = ListingOptions::new(Arc::new(ParquetFormat::default()));

    let schema = options.infer_schema(&state, &path).await.unwrap();

    schema.fields.iter().for_each(|field|  println!("{0}", field.name()));
}

the result in macOs Ventura:

description
code
year

the first file pickup was the file3.parquet
and using windows

year
code
description

the first file pickup was the file1.parquet

Expected behavior

The same schema independently the OS where the code is run. A sort should be forced or at least given the possibility of passing a sort function

Additional context

No response

@tiago-ssantos tiago-ssantos added the bug Something isn't working label Mar 29, 2023
@tiago-ssantos tiago-ssantos changed the title Different schemas when inferring schema from folder Different schemas when inferring schema from local system in different OS Mar 29, 2023
@tiago-ssantos tiago-ssantos changed the title Different schemas when inferring schema from local system in different OS Different schemas when inferring from local system in different OS Mar 29, 2023
@tustvold tustvold added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 30, 2023
@thomas-k-cameron
Copy link
Contributor

I just created a PR for this issue. There are still some work to do (e.g. tests) but I hope it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants