Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Aug 21, 2025

Which issue does this PR close?

Rationale for this change

VariantArrayBuilder had a very complex choreography with the VariantBuilder API, that required lots of manual drop glue to deal with ownership transfers between it and the VariantArrayVariantBuilder it delegates the actual work to. Rework the whole thing to use a (now-reusable) MetadataBuilder and ValueBuilder, with rollbacks largely handled by ParentState -- just like the other builders in the parquet-variant crate.

What changes are included in this PR?

Five changes (curated as five commits that reviewers may want to examine individually):

  1. Make a bunch of parquet-variant builder infrastructure public, so that VariantArrayBuilder can access it from the parquet-variant-compute crate.
  2. Make MetadataBuilder reusable. Its finish method appends the bytes of a new serialized metadata dictionary to the underlying buffer and resets the remaining builder state. The builder is thus ready to create a brand new metadata dictionary whose serialized bytes will also be appended to the underlying buffer once finished.
  3. Rework VariantArrayBuilder to use MetadataBuilder and ValueBuilder, coordinated via ParentState. This is the main feature of the PR and also the most complicated/subtle.
  4. Delete now-unused code that had been added previously in order to support the old implementation of VariantArrayBuilder.
  5. Add missing doc comments for now-public types and methods

Are these changes tested?

Existing variant array builder tests cover the change.

Are there any user-facing changes?

A lot of builder-related types and methods from the parquet-variant crate are now public.

/// Builder for the in progress variant value, temporarily owns the buffers
/// from `array_builder`
variant_builder: VariantBuilder,
parent_state: ParentState<'a>,
Copy link
Contributor Author

@scovich scovich Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: I finally figured out why my pathfinding PR couldn't get the variant array builder to work correctly when storing end-offsets instead of start-offsets: It's because I wasn't using a top-level parent state here, so rollbacks were unreliable. Out of pure luck, the unit test that validates nested rollbacks wrote the same number of bytes for the rolled back and finalized nested builders, and by storing the starting offset we got lucky to observe the correct byte slice in spite of the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated diagnosis: When storing starting offset, the extra bytes that failed to roll back were harmlessly "appended" to the previous row as padding after the "real" variant value. But when storing ending offset, the extra bytes were "prepended" to the next row, whose own value was thus wrongly ignored as padding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, the important thing is -- adding this parent_state here makes rollbacks reliable, and so it no longer matters whether the builder stores starting or ending offsets.

Copy link
Contributor

@codephage2020 codephage2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall! There are only two minor document typos. Thanks for the contribution!

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reusable-metadata-builder (88dfc47) to cec24a0 diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=reusable-metadata-builder
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                                                                main                                   reusable-metadata-builder
-----                                                                ----                                   -------------------------
batch_json_string_to_variant json_list 8k string                     1.00     25.9±0.13ms        ? ?/sec    1.05     27.2±0.10ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.01    298.1±6.03ms        ? ?/sec    1.00    295.7±2.26ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.02      7.8±0.16ms        ? ?/sec    1.00      7.7±0.02ms        ? ?/sec
variant_get_primitive                                                1.24   1085.1±3.29µs        ? ?/sec    1.00    878.3±8.23µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reusable-metadata-builder (88dfc47) to cec24a0 diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=reusable-metadata-builder
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                                       main                                   reusable-metadata-builder
-----                                       ----                                   -------------------------
bench_extend_metadata_builder               1.00     52.2±2.06ms        ? ?/sec    1.06     55.2±1.58ms        ? ?/sec
bench_object_field_names_reverse_order      1.06     19.7±0.78ms        ? ?/sec    1.00     18.6±0.96ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1215.4±14.33µs        ? ?/sec    1.00  1212.0±14.88µs        ? ?/sec
bench_object_list_same_schema               1.01     24.1±0.19ms        ? ?/sec    1.00     23.9±0.30ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.1±0.07ms        ? ?/sec    1.00     13.1±0.10ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.2±0.01ms        ? ?/sec    1.00      3.2±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     37.0±0.14ms        ? ?/sec    1.00     37.2±0.08ms        ? ?/sec
bench_object_unknown_schema                 1.00     15.9±0.04ms        ? ?/sec    1.00     16.0±0.05ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.6±0.01ms        ? ?/sec    1.00      2.6±0.01ms        ? ?/sec
iteration/validated_iteration               1.13     55.3±0.12µs        ? ?/sec    1.00     49.0±0.24µs        ? ?/sec
validation/unvalidated_construction         1.00      6.7±0.02µs        ? ?/sec    1.00      6.7±0.02µs        ? ?/sec
validation/validated_construction           1.00     60.2±0.79µs        ? ?/sec    1.01     61.0±0.13µs        ? ?/sec
validation/validation_cost                  1.00     53.3±0.07µs        ? ?/sec    1.02     54.3±0.14µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reusable-metadata-builder (88dfc47) to cec24a0 diff
BENCH_NAME=variant_validation
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_validation
BENCH_FILTER=
BENCH_BRANCH_NAME=reusable-metadata-builder
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

🤖: Benchmark completed

Details

group                               main                                   reusable-metadata-builder
-----                               ----                                   -------------------------
bench_validate_complex_object       1.00    230.1±0.48µs        ? ?/sec    1.06    243.0±0.24µs        ? ?/sec
bench_validate_large_nested_list    1.01     19.3±0.05ms        ? ?/sec    1.00     19.2±0.04ms        ? ?/sec
bench_validate_large_object         1.00     54.3±0.08ms        ? ?/sec    1.02     55.4±0.09ms        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich and @codephage2020 -- I think this looks very nice 👏

@alamb alamb merged commit 32b385b into apache:main Aug 23, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VariantArrayBuilder uses ParentState for simpler rollbacks
3 participants