Skip to content

Conversation

@aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 16, 2025

Rationale for this change

According to the Variant specification, the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.
image

What changes are included in this PR?

The change includes defaulting the specification version to 1.

Are these changes tested?

The change is covered by unit test.

Are there any user-facing changes?

The Parquet files produced the variant logical type annotation VARIANT(1).

Schema:
message schema {
  optional group V (VARIANT(1)) = 1 {
    required binary metadata;
    required binary value;
  }
}

@aihuaxu aihuaxu requested a review from wgtmac as a code owner October 16, 2025 17:52
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 16, 2025

@wgtmac Can you take a look?

@aihuaxu aihuaxu changed the title Set Variant specification version to 1 to align with the variant spec GH-47838: [C++] Set Variant specification version to 1 to align with the variant spec Oct 16, 2025
@github-actions
Copy link

⚠️ GitHub issue #47838 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 16, 2025
@aihuaxu aihuaxu changed the title GH-47838: [C++] Set Variant specification version to 1 to align with the variant spec GH-47838: [C++][Parquet] Set Variant specification version to 1 to align with the variant spec Oct 17, 2025
@aihuaxu aihuaxu requested a review from wgtmac October 17, 2025 04:45
@aihuaxu aihuaxu force-pushed the aixu-update-spec-version branch from a7c4d0d to 3450f1c Compare October 17, 2025 05:00
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I'm fine with the current change for now. Eventually we may need to add setter and getter to VariantLogicalType for specification_version and validating its value.

cc @pitrou @raulcd

@pitrou
Copy link
Member

pitrou commented Oct 17, 2025

Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.

Why is it not the case for our own reader?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 17, 2025

Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.

Why is it not the case for our own reader?

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

@aihuaxu aihuaxu force-pushed the aixu-update-spec-version branch from 4b65ada to 38a919d Compare October 17, 2025 18:37
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 17, 2025

@wgtmac and @pitrou I added specification_version to VariantLogicalType. Please take another look. Thanks.

@raulcd
Copy link
Member

raulcd commented Oct 17, 2025

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

Sorry, because I might be missing something obvious, I am not too familiar with this part of the codebase, but if we haven't implemented Parquet C++ Variant write yet, I am not sure I understand how can a user would be able to create Variant files with a logical type annotation that will be incorrect with Parquet C++ if we release without this fix.

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Oct 17, 2025

You mean Parquet C++ reader? Currently Parquet C++ reader hasn't implemented Variant read/write yet. The only thing it added is to support Variant logical type and we are not doing the check. Currently Variant spec version is version 1 (and this is the only version) and the other readers may/may not add the check.

Sorry, because I might be missing something obvious, I am not too familiar with this part of the codebase, but if we haven't implemented Parquet C++ Variant write yet, I am not sure I understand how can a user would be able to create Variant files with a logical type annotation that will be incorrect with Parquet C++ if we release without this fix.

The engines will implement the reader/writer parts but will use the variant type defined in Arrow Parquet. That would cause the engines to write incorrect annotation. That's what I'm seeing internally.

@raulcd
Copy link
Member

raulcd commented Oct 20, 2025

We haven't been able to publish Python 3.14 wheels for PyArrow and the community is eager to get those. @pitrou @wgtmac can you help with this issue?
Is the issue a blocker? If it is, can we get this fixed so we can generate a new RC soon, if it is not, can you share the thoughts on the ML thread so we can proceed with RC0?

}

std::shared_ptr<const LogicalType> VariantLogicalType::Make(const int8_t specVersion) {
auto logical_type = std::shared_ptr<VariantLogicalType>(new VariantLogicalType());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use std::make_shared? It's more terse and more efficient.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been addressed or resolved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the VariantLogicalType constructor is private, so std::make_shared cannot work with it.

@pitrou
Copy link
Member

pitrou commented Oct 20, 2025

We haven't been able to publish Python 3.14 wheels for PyArrow and the community is eager to get those. @pitrou @wgtmac can you help with this issue?

You mean help with the 3.14 wheels? I'm not sure I understand your message correctly.

Is the issue a blocker?

I would not call it a blocker entirely, but we would certainly rather have it.

@raulcd
Copy link
Member

raulcd commented Oct 20, 2025

You mean help with the 3.14 wheels? I'm not sure I understand your message correctly.

No, sorry, I meant this issue which is currently holding the release and holding things like publishing the Python 3.14 wheels.

@pitrou pitrou force-pushed the aixu-update-spec-version branch from 38a919d to cc1a45b Compare October 20, 2025 07:52
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 now that comments are addressed. Let's wait for CI.

@pitrou
Copy link
Member

pitrou commented Oct 20, 2025

CI failures are unrelated, I'll merge. Thanks for spotting and fixing this @aihuaxu !

@pitrou pitrou merged commit 5f616db into apache:main Oct 20, 2025
111 of 160 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 20, 2025
@pitrou
Copy link
Member

pitrou commented Oct 20, 2025

@raulcd I'll let you make the final decision, but it would be nice if a new RC could be issued with this fix.

@raulcd
Copy link
Member

raulcd commented Oct 20, 2025

I'll create a new RC. Thanks everyone for the quick fix!

raulcd pushed a commit that referenced this pull request Oct 20, 2025
…ign with the variant spec (#47835)

### Rationale for this change
According to the [Variant specification](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md), the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.
<img width="624" height="185" alt="image" src="https://github.com/user-attachments/assets/b0f1deb9-0301-4b94-a472-17fd9cc0df5d" />

### What changes are included in this PR?
The change includes defaulting the specification version to 1.
### Are these changes tested?
The change is covered by unit test.
### Are there any user-facing changes?
The Parquet files produced the variant logical type annotation `VARIANT(1)`.

```
Schema:
message schema {
  optional group V (VARIANT(1)) = 1 {
    required binary metadata;
    required binary value;
  }
}
```

* GitHub Issue: #47838

Lead-authored-by: Aihua <aihua.xu@snowflake.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@wgtmac
Copy link
Member

wgtmac commented Oct 20, 2025

Thanks @pitrou and @raulcd!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 5f616db.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 20, 2025
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Nov 5, 2025
… to align with the variant spec (apache#47835)

### Rationale for this change
According to the [Variant specification](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md), the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types.
<img width="624" height="185" alt="image" src="https://github.com/user-attachments/assets/b0f1deb9-0301-4b94-a472-17fd9cc0df5d" />

### What changes are included in this PR?
The change includes defaulting the specification version to 1.
### Are these changes tested?
The change is covered by unit test.
### Are there any user-facing changes?
The Parquet files produced the variant logical type annotation `VARIANT(1)`.

```
Schema:
message schema {
  optional group V (VARIANT(1)) = 1 {
    required binary metadata;
    required binary value;
  }
}
```

* GitHub Issue: apache#47838

Lead-authored-by: Aihua <aihua.xu@snowflake.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants