Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet file content is different if ~/.cargo is in a git checkout #589

Closed
carols10cents opened this issue Jul 21, 2021 · 0 comments · Fixed by #590
Closed

Parquet file content is different if ~/.cargo is in a git checkout #589

carols10cents opened this issue Jul 21, 2021 · 0 comments · Fixed by #590
Labels

Comments

@carols10cents
Copy link
Contributor

Describe the bug

I check my home directory into git. My home directory contains .cargo, my CARGO_HOME directory. When I write a Parquet file, its FileMetaData contains:

created_by: Some(
    "parquet-rs version 5.0.0 (build 3ef76a677716df403a13964a58351abe37c1754d)",
),

That SHA is of a commit in my home directory, not in Parquet, and not in the project using Parquet.

I have a test in the project that verifies the size of the parquet file data, and the test was failing for me because the content was 49 bytes too much, the exact size of the extra content above. I verified that in CI, the test passes, and the FileMetaData under test contains:

created_by: Some(
    "parquet-rs version 5.0.0",
),

To Reproduce

  • Check your home directory into git, or alternately set CARGO_HOME to a directory in a git repository.
  • Generate a parquet file and check the metadata.
  • Observe the created_by contains a hash from the git directory CARGO_HOME is in.

I'm not sure if it's going to be possible to create a failing test for this given the environmental aspect... the current test only checks that the created_at value is the value of the PARQUET_CREATED_BY environment variable but the problem is what gets in the PARQUET_CREATED_BY environment variable in the first place.

Expected behavior

I expected to get the exact same Parquet file content whether my home directory is checked into Git or not 🤣

Additional context

The PARQUET_CREATED_BY environment variable is set in the build script if git rev-parse HEAD returns a value. Considering this is only getting set if you have a non-standard setup like I do, I think this should just be removed entirely. I'm going to prepare a PR for discussion with this solution :)

carols10cents added a commit to integer32llc/arrow-rs that referenced this issue Jul 21, 2021
So that Parquet files will contain the same content whether or not your
home directory is checked into Git or not ;)

Fixes apache#589.
nevi-me pushed a commit that referenced this issue Jul 22, 2021
So that Parquet files will contain the same content whether or not your
home directory is checked into Git or not ;)

Fixes #589.
alamb pushed a commit that referenced this issue Jul 25, 2021
So that Parquet files will contain the same content whether or not your
home directory is checked into Git or not ;)

Fixes #589.
alamb added a commit that referenced this issue Jul 26, 2021
So that Parquet files will contain the same content whether or not your
home directory is checked into Git or not ;)

Fixes #589.

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant