-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35746: [Parquet][C++][Python] Switch default Parquet version to 2.6 #36137
Conversation
|
Taking a quick look at some of the Python failures:
|
For C++, the test failures are in
and I suppose those will also be updated to reflect the new defaults (nanoseconds are preserved), or updated to test both old default of 2.4 and new of 2.6 |
@jorisvandenbossche This was so helpful! Yeah, with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python changes all look good!
@mapleFU could you have a look at the C++ test changes?
At the moment it is mostly updating the tests for the new default, not fully sure how well the explicit versions are covered in other tests.
expected = pa.Table.from_arrays([a_us, a_us, a_us, a_us], names) | ||
_check_roundtrip(table, expected, version='2.6', coerce_timestamps='us') | ||
for ver in all_versions: | ||
_check_roundtrip(table, expected, version=ver, coerce_timestamps='us') | ||
|
||
# TODO: after pyarrow allows coerce_timestamps='ns', tests like the | ||
# following should pass ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be left for a follow-up, but we could now allow this (in _create_arrow_writer_properties
in _parquet.pyx we would need to update the cython bindings to pass it through, I think on C++ side it's already implemented)
@@ -3334,7 +3334,7 @@ TEST(ArrowReadWrite, NestedRequiredOuterOptional) { | |||
auto arrow_writer_props = ArrowWriterProperties::Builder(); | |||
arrow_writer_props.store_schema(); | |||
if (inner_type->id() == ::arrow::Type::UINT32) { | |||
writer_props.version(ParquetVersion::PARQUET_2_4); | |||
writer_props.version(ParquetVersion::PARQUET_2_6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default writer_props created above with WriterProperties::Builder();
will now use 2_6 by default, so potentially this could be removed alltogether
@@ -891,7 +890,7 @@ TEST_F(TestConvertArrowSchema, ArrowFields) { | |||
LogicalType::Timestamp(true, LogicalType::TimeUnit::MICROS), ParquetType::INT64, | |||
-1}, | |||
{"timestamp(nanosecond, CET)", ::arrow::timestamp(::arrow::TimeUnit::NANO, "CET"), | |||
LogicalType::Timestamp(true, LogicalType::TimeUnit::MICROS), ParquetType::INT64, | |||
LogicalType::Timestamp(true, LogicalType::TimeUnit::NANOS), ParquetType::INT64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extract NANOS, and test with both 2_4 and 2_6? Since 2.4 might still be used for some time
The test in C++ looks ok. But I'm afraid that we might leak some test under 2.4, so maybe we can separate and test both 2.4 and 2.6? |
|
@github-actions crossbow submit -g integration |
Revision: 932de52 Submitted crossbow builds: ursacomputing/crossbow @ actions-34c5b677fb |
@@ -1973,7 +1983,9 @@ TEST(TestArrowReadWrite, ParquetVersionTimestampDifferences) { | |||
field("ts:us", t_us), field("ts:ns", t_ns)}); | |||
auto input_table = Table::Make(input_schema, {a_s, a_ms, a_us, a_ns}); | |||
|
|||
auto parquet_version_1_properties = ::parquet::default_writer_properties(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch!
Co-authored-by: mwish <1506118561@qq.com>
I think this failure is not related, the AsofJoinTest regularly fails on main as well. See #36482 |
There might still be a few cases where the test coverage for all options can be improved (see some of the comments), but if needed, let's do that in a follow-up PR. |
Thanks @anjakefala! |
Conbench analyzed the 6 benchmark runs on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Change the default parquet version to 2.6.
Discussed in:
ML
Issue
Closes: [Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6 #35746