Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Hudi #2

Merged
merged 2 commits into from
Feb 10, 2023
Merged

Conversation

alexeykudinkin
Copy link

@alexeykudinkin alexeykudinkin commented Jan 4, 2023

Description

This PR adds support for Hudi into BrooklynData benchmarking suite.

How was this patch tested?

Adding Hudi support to BrooklynData benchmarks we run this suite on EMR w/ the same configuration as have been called out in this blog, obtaining following results:

Screenshot 2023-01-03 at 9 55 40 PM

Screenshot 2023-01-03 at 9 55 50 PM

Things to note:

  • Hudi by default adds materialized meta-fields (like primary-key, partition-path, etc) to every record which help it subsequently speed up updates (by the ways of being able to leverage Bloom index for ex) as well as powering Hudi's other stand-out features such as Incremental reads (enabling Incremental ETL for ex). Such amendment comes at the expense of slightly slower bulk-insert operation (in CREATE TABLE AS SELECT)
  • Hudi's upsert performance (step 2) is dominated by unnecessary conversion to Avro, which is what community currently focusing on rectifying
  • Hudi performs better in workloads w/ higher selectivity (step 3, ie reading just a handful of files), since it relies on Bloom index
  • Hudi sets lower target file-size limit of 120Mb to balance Query performance against Write amplification problem, which is lower than Delta's default value (of 256Mb) and therefore makes Hudi produce more files and therefore affecting its query performance relative to Delta. We deliberately not tried to tune this setting up for more fair comparison.

Does this PR introduce any user-facing changes?

N/A

| 'hoodie.table.keygenerator.class' = 'org.apache.hudi.keygen.ComplexKeyGenerator',
| 'hoodie.parquet.compression.codec' = 'snappy',
| 'hoodie.datasource.write.hive_style_partitioning' = 'true',
| 'hoodie.sql.insert.mode'= 'non-strict',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following 3 options are actually shouldn't need to be specified explicitly, however in our recent round of testing we've discovered a regression (HUDI-5499) that prompted us to employ these here for a proper comparison

@vinothchandar
Copy link

Can we please get a review on this?

@jecolvin jecolvin merged commit 7448dd8 into brooklyn-data:master Feb 10, 2023
@jecolvin
Copy link
Member

Sorry for the delay! Approved and merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants