Added support for Hudi #2

alexeykudinkin · 2023-01-04T06:23:30Z

Description

This PR adds support for Hudi into BrooklynData benchmarking suite.

How was this patch tested?

Adding Hudi support to BrooklynData benchmarks we run this suite on EMR w/ the same configuration as have been called out in this blog, obtaining following results:

Things to note:

Hudi by default adds materialized meta-fields (like primary-key, partition-path, etc) to every record which help it subsequently speed up updates (by the ways of being able to leverage Bloom index for ex) as well as powering Hudi's other stand-out features such as Incremental reads (enabling Incremental ETL for ex). Such amendment comes at the expense of slightly slower bulk-insert operation (in CREATE TABLE AS SELECT)
Hudi's upsert performance (step 2) is dominated by unnecessary conversion to Avro, which is what community currently focusing on rectifying
Hudi performs better in workloads w/ higher selectivity (step 3, ie reading just a handful of files), since it relies on Bloom index
Hudi sets lower target file-size limit of 120Mb to balance Query performance against Write amplification problem, which is lower than Delta's default value (of 256Mb) and therefore makes Hudi produce more files and therefore affecting its query performance relative to Delta. We deliberately not tried to tune this setting up for more fair comparison.

Does this PR introduce any user-facing changes?

N/A

alexeykudinkin · 2023-01-05T00:09:04Z

benchmarks/src/main/scala/benchmark/etl/ETLBenchmark.scala

+         |  'hoodie.table.keygenerator.class' = 'org.apache.hudi.keygen.ComplexKeyGenerator',
+         |  'hoodie.parquet.compression.codec' = 'snappy',
+         |  'hoodie.datasource.write.hive_style_partitioning' = 'true',
+         |  'hoodie.sql.insert.mode'= 'non-strict',


Following 3 options are actually shouldn't need to be specified explicitly, however in our recent round of testing we've discovered a regression (HUDI-5499) that prompted us to employ these here for a proper comparison

vinothchandar · 2023-01-11T23:42:41Z

Can we please get a review on this?

jecolvin · 2023-02-10T16:49:41Z

Sorry for the delay! Approved and merged!

Alexey Kudinkin added 2 commits January 3, 2023 22:22

Added support for Hudi

1844899

Add write-operation config (necessary due to discovered bug)

df6421f

alexeykudinkin commented Jan 5, 2023

View reviewed changes

jecolvin approved these changes Feb 10, 2023

View reviewed changes

jecolvin merged commit 7448dd8 into brooklyn-data:master Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Hudi #2

Added support for Hudi #2

alexeykudinkin commented Jan 4, 2023 •

edited

Loading

alexeykudinkin Jan 5, 2023

vinothchandar commented Jan 11, 2023

jecolvin commented Feb 10, 2023

Added support for Hudi #2

Added support for Hudi #2

Conversation

alexeykudinkin commented Jan 4, 2023 • edited Loading

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

alexeykudinkin Jan 5, 2023

Choose a reason for hiding this comment

vinothchandar commented Jan 11, 2023

jecolvin commented Feb 10, 2023

alexeykudinkin commented Jan 4, 2023 •

edited

Loading