fix: hybrid storage encode bug in multi record batch #426

chunshao90 · 2022-11-27T08:03:10Z

Which issue does this PR close?

Closes #403

Rationale for this change

Fix bugs described in #403 .
Hybrid storage compresses multi-record batches into one record batch.
There is a bug where collapsible_col_arrays in TsidBatch only stores the first record batch data, other record batch data are lost.

What changes are included in this PR?

Are there any user-facing changes?

No.

How does this change test

Modify some tests to cover this bug.

analytic_engine/src/sst/parquet/hybrid.rs

jiacai2050 · 2022-11-28T13:55:40Z

@chunshao90 I think of one critical issue for hybrid format: it will not be compatible with bloom filter.

We built bloom filter for each row group, there will be exact num_rows_per_row_group rows in one row group, since ParquetWriter will flush row group until the accumulated rows reach num_rows_per_row_group, otherwise it will just pending them.

The problem is that hybrid storage will shorten rows in one row group, for example if we set num_rows_per_row_group to 8192, after hybrid conversion, it will less than 8192, then ParquetWriter will not flush this row group since 8192 is not reached.

analytic_engine/src/sst/parquet/hybrid.rs

chunshao90 · 2022-11-30T08:34:32Z

@chunshao90 I think of one critical issue for hybrid format: it will not be compatible with bloom filter.

We built bloom filter for each row group, there will be exact num_rows_per_row_group rows in one row group, since ParquetWriter will flush row group until the accumulated rows reach num_rows_per_row_group, otherwise it will just pending them.

The problem is that hybrid storage will shorten rows in one row group, for example if we set num_rows_per_row_group to 8192, after hybrid conversion, it will less than 8192, then ParquetWriter will not flush this row group since 8192 is not reached.

I create an issue #435.

jiacai2050

LGTM

* fix: fix hybridstorage ecode bug in multi record batch * cargo fmt * refactor code * refactor by CR

chunshao90 added 3 commits November 27, 2022 17:06

fix: fix hybridstorage ecode bug in multi record batch

c873283

cargo fmt

8ad04a1

refactor code

44b6696

chunshao90 changed the title ~~fix: fix hybridstorage ecode bug in multi record batch~~ fix: fix hybridstorage encode bug in multi record batch Nov 28, 2022

jiacai2050 changed the title ~~fix: fix hybridstorage encode bug in multi record batch~~ fix: hybrid storage encode bug in multi record batch Nov 28, 2022

jiacai2050 requested changes Nov 28, 2022

View reviewed changes

analytic_engine/src/sst/parquet/hybrid.rs Show resolved Hide resolved

analytic_engine/src/sst/parquet/hybrid.rs Show resolved Hide resolved

jiacai2050 requested changes Nov 30, 2022

View reviewed changes

analytic_engine/src/sst/parquet/hybrid.rs Outdated Show resolved Hide resolved

analytic_engine/src/sst/parquet/hybrid.rs Outdated Show resolved Hide resolved

analytic_engine/src/sst/parquet/hybrid.rs Outdated Show resolved Hide resolved

refactor by CR

337858b

refactor by CR

8fef0fd

jiacai2050 approved these changes Dec 1, 2022

View reviewed changes

chunshao90 merged commit f3ba886 into apache:main Dec 1, 2022

chunshao90 added a commit to chunshao90/ceresdb that referenced this pull request May 15, 2023

fix: hybrid storage encode bug in multi record batch (apache#426)

f567648

* fix: fix hybridstorage ecode bug in multi record batch * cargo fmt * refactor code * refactor by CR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: hybrid storage encode bug in multi record batch #426

fix: hybrid storage encode bug in multi record batch #426

chunshao90 commented Nov 27, 2022 •

edited

Loading

jiacai2050 commented Nov 28, 2022

chunshao90 commented Nov 30, 2022

jiacai2050 left a comment

fix: hybrid storage encode bug in multi record batch #426

fix: hybrid storage encode bug in multi record batch #426

Conversation

chunshao90 commented Nov 27, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How does this change test

jiacai2050 commented Nov 28, 2022

chunshao90 commented Nov 30, 2022

jiacai2050 left a comment

Choose a reason for hiding this comment

chunshao90 commented Nov 27, 2022 •

edited

Loading