Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35746: [Parquet][C++][Python] Switch default Parquet version to 2.6 #36137

Merged
merged 9 commits into from
Jul 6, 2023

Conversation

anjakefala
Copy link
Collaborator

@anjakefala anjakefala commented Jun 16, 2023

Change the default parquet version to 2.6.

Discussed in:

@github-actions
Copy link

⚠️ GitHub issue #35746 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member

Taking a quick look at some of the Python failures:

  • test_write_error_deletes_incomplete_file: I assume the test relied on parquet.write_table to raise an error because of nanosecods being present, and they would get truncated. So either we need to add something else that errors, or we can add version="2.4" to keep it using the older parquet version and thus it still errors
  • test_timestamp_restore_timezone_nanosecond: it is creating the expected result with microseconds, but now it comes back as nanoseconds. I think here we want to test explicitly that the timezone is preserved when the resolution changes, so I would also add a version="2.4" to keep the test as it was. But you could parametrize the test just above to use different resolutions (ms, us, ns) to cover all of them.
  • test_parquet_version_timestamp_differences: also needs to be updated because the expected now contains nanoseconds instead of microseconds (but there are some comments there that are a bit outdated it seems ..)

@jorisvandenbossche
Copy link
Member

For C++, the test failures are in

 [  FAILED  ] 2 tests, listed below:
[  FAILED  ] TestArrowReadWrite.DateTimeTypes
[  FAILED  ] TestArrowReadWrite.ParquetVersionTimestampDifferences

and I suppose those will also be updated to reflect the new defaults (nanoseconds are preserved), or updated to test both old default of 2.4 and new of 2.6

@anjakefala
Copy link
Collaborator Author

anjakefala commented Jun 27, 2023

@jorisvandenbossche This was so helpful!

Yeah, with the _check_roundtrip calls, it seems that when the version is not explicitly set, it uses the default for write_table. And sometimes, the assumption was that this default version was still '1.0'. Tested behaviour didn't change for 2.4 and so it didn't matter that the new default version was being used for _check_roundtrip.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python changes all look good!

@mapleFU could you have a look at the C++ test changes?
At the moment it is mostly updating the tests for the new default, not fully sure how well the explicit versions are covered in other tests.

expected = pa.Table.from_arrays([a_us, a_us, a_us, a_us], names)
_check_roundtrip(table, expected, version='2.6', coerce_timestamps='us')
for ver in all_versions:
_check_roundtrip(table, expected, version=ver, coerce_timestamps='us')

# TODO: after pyarrow allows coerce_timestamps='ns', tests like the
# following should pass ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be left for a follow-up, but we could now allow this (in _create_arrow_writer_properties in _parquet.pyx we would need to update the cython bindings to pass it through, I think on C++ side it's already implemented)

@@ -3334,7 +3334,7 @@ TEST(ArrowReadWrite, NestedRequiredOuterOptional) {
auto arrow_writer_props = ArrowWriterProperties::Builder();
arrow_writer_props.store_schema();
if (inner_type->id() == ::arrow::Type::UINT32) {
writer_props.version(ParquetVersion::PARQUET_2_4);
writer_props.version(ParquetVersion::PARQUET_2_6);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default writer_props created above with WriterProperties::Builder(); will now use 2_6 by default, so potentially this could be removed alltogether

@jorisvandenbossche jorisvandenbossche added this to the 13.0.0 milestone Jul 3, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting review Awaiting review labels Jul 3, 2023
cpp/src/parquet/column_writer_test.cc Show resolved Hide resolved
@@ -891,7 +890,7 @@ TEST_F(TestConvertArrowSchema, ArrowFields) {
LogicalType::Timestamp(true, LogicalType::TimeUnit::MICROS), ParquetType::INT64,
-1},
{"timestamp(nanosecond, CET)", ::arrow::timestamp(::arrow::TimeUnit::NANO, "CET"),
LogicalType::Timestamp(true, LogicalType::TimeUnit::MICROS), ParquetType::INT64,
LogicalType::Timestamp(true, LogicalType::TimeUnit::NANOS), ParquetType::INT64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we extract NANOS, and test with both 2_4 and 2_6? Since 2.4 might still be used for some time

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Jul 3, 2023
@mapleFU
Copy link
Member

mapleFU commented Jul 3, 2023

The test in C++ looks ok. But I'm afraid that we might leak some test under 2.4, so maybe we can separate and test both 2.4 and 2.6?

@anjakefala anjakefala marked this pull request as ready for review July 5, 2023 22:04
@anjakefala
Copy link
Collaborator Author

AsofJoinTest needs to be updated as well.

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit -g integration

@github-actions
Copy link

github-actions bot commented Jul 6, 2023

Revision: 932de52

Submitted crossbow builds: ursacomputing/crossbow @ actions-34c5b677fb

Task Status
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-master Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.1.2 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-conda-python-3.9-spark-v3.2.0 Github Actions

cpp/src/parquet/column_writer_test.cc Show resolved Hide resolved
@@ -1973,7 +1983,9 @@ TEST(TestArrowReadWrite, ParquetVersionTimestampDifferences) {
field("ts:us", t_us), field("ts:ns", t_ns)});
auto input_table = Table::Make(input_schema, {a_s, a_ms, a_us, a_ns});

auto parquet_version_1_properties = ::parquet::default_writer_properties();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

Co-authored-by: mwish <1506118561@qq.com>
@jorisvandenbossche
Copy link
Member

AsofJoinTest needs to be updated as well.

I think this failure is not related, the AsofJoinTest regularly fails on main as well. See #36482

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jul 6, 2023
@jorisvandenbossche
Copy link
Member

There might still be a few cases where the test coverage for all options can be improved (see some of the comments), but if needed, let's do that in a follow-up PR.
I am going to merge this now, because we have another PR waiting on this one that we also want to get in before the feature freeze.

@jorisvandenbossche jorisvandenbossche merged commit 22f7e59 into apache:main Jul 6, 2023
30 of 31 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Jul 6, 2023
@jorisvandenbossche
Copy link
Member

Thanks @anjakefala!

@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 22f7e597.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6
3 participants