Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] pa.schema() silently drops metadata if a schema object is passed #38575

Closed
igozali opened this issue Nov 3, 2023 · 3 comments

Comments

@igozali
Copy link

igozali commented Nov 3, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Seems like a regression from arrow 13 -> 14

Here's a simple repro script

import pyarrow as pa
import pyarrow.parquet as pq

print(pa.__version__)

metadata = {b"foo": b"bar"}
schema = pa.schema([pa.field("foo", pa.int32())])
wrapped_schema = pa.schema(schema, metadata=metadata)

w = pq.ParquetWriter(
    "foo.parquet", 
    wrapped_schema,
    flavor="spark", 
    compression="snappy"
)
w.close()

s = pq.read_schema("foo.parquet")
print(f"{s.metadata=} {metadata=}")
print(f"{s.metadata == metadata=}")

Running above script on pyarrow 13 and 14 gives these outputs:

(arrow13)
[11:43:42 | last: 43s] (  0) | ~
igozali@host $ python test.py
13.0.0
s.metadata={b'foo': b'bar'} metadata={b'foo': b'bar'}
s.metadata == metadata=True

(arrow13)
[11:43:44 | last: 0s] (  0) | ~
igozali@host $ conda activate arrow14

(arrow14)
[11:43:49 | last: 0s] (  0) | ~
igozali@host $ python test.py
14.0.0
s.metadata=None metadata={b'foo': b'bar'}
s.metadata == metadata=False

Few workarounds:

  1. Use schema.with_metadata() instead of pa.schema()
  2. Instead of pa.schema(old_schema, metadata=...), passing a list of fields seems to fix it too e.g. pa.schema(list(old_schema), metadata=...)

Component(s)

Parquet, Python

@igozali igozali changed the title pa.schema() silently drops metadata if a schema object is passed [Python][Parquet] pa.schema() silently drops metadata if a schema object is passed Nov 3, 2023
@JacobHayes
Copy link
Contributor

JacobHayes commented May 4, 2024

I came across this too - it also appears to also happen if you pass a pa.struct into pa.schema, eg:

$ python3
Python 3.12.2 (main, May  3 2024, 17:50:09) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> pa.__version__
'16.0.0'
>>> pa.schema([pa.field("test", pa.int64())], metadata={"a": "b"})
test: int64
-- schema metadata --
a: 'b'
>>> pa.schema(pa.struct([pa.field("test", pa.int64())]), metadata={"a": "b"})
test: int64

@JacobHayes
Copy link
Contributor

I think it was introduced in #37797, which short-circuits the creation without setting the metadata:

arrow/python/pyarrow/types.pxi

Lines 5093 to 5094 in d68f8e2

elif hasattr(fields, "__arrow_c_schema__"):
return Schema._import_from_c_capsule(fields.__arrow_c_schema__())

JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 4, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 4, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 4, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 6, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 6, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 7, 2024
JacobHayes added a commit to JacobHayes/arrow that referenced this issue May 7, 2024
jorisvandenbossche added a commit that referenced this issue May 17, 2024
…psule (#41538)

### Rationale for this change

Fixes the dropped `pa.schema` metadata reported in #38575, which was introduced in #37797.

### What changes are included in this PR?

Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`.

### Are these changes tested?

Yes - added `metadata` to the existing test.

### Are there any user-facing changes?

I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update):

**This PR contains a "Critical Fix".**

* GitHub Issue: #38575

Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 17.0.0 milestone May 17, 2024
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 41538
#41538

vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…m PyCapsule (apache#41538)

### Rationale for this change

Fixes the dropped `pa.schema` metadata reported in apache#38575, which was introduced in apache#37797.

### What changes are included in this PR?

Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`.

### Are these changes tested?

Yes - added `metadata` to the existing test.

### Are there any user-facing changes?

I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update):

**This PR contains a "Critical Fix".**

* GitHub Issue: apache#38575

Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants