Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(streaming): use key encoding in SerializedKey #5596

Merged
merged 8 commits into from
Sep 29, 2022

Conversation

Gun9niR
Copy link
Contributor

@Gun9niR Gun9niR commented Sep 27, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

As title.

Because value encoding requires DataType in deserialization, in deserialize_to_builders the types of the hash key needs to be known. I have two choices: pass in &[DataType] from the caller or infer the type from ArrayBuilderImpl's variant. I choose the former, because:

  • The former method can utilize existing Vec<DataType> stored in the caller instead of creating new instances.
  • Mapping ArrayBuilderImpl's variant to DataType involves more code change, including the for_all_variant macro and all related macros, and one more enum dispatching via macro, which makes the code harder to read.
  • If DataType has some fields in the variant (such as list or struct, although they don't seem to be used), it's very troublesome to derive them from ArrayBuilderImpl. We need to pass the complete DataType into the builder only for the purpose of deserialization, which is not very necessary.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.

Types of user-facing changes

Please keep the types that apply to your changes, and remove those that do not apply.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.

Refer to a related PR or issue link (optional)

Closes #5526

@codecov
Copy link

codecov bot commented Sep 27, 2022

Codecov Report

Merging #5596 (f4e8009) into main (960e11b) will increase coverage by 0.08%.
The diff coverage is 95.16%.

❗ Current head f4e8009 differs from pull request most recent head 8c1fe9d. Consider uploading reports for the commit 8c1fe9d to get more accurate results

@@            Coverage Diff             @@
##             main    #5596      +/-   ##
==========================================
+ Coverage   74.24%   74.32%   +0.08%     
==========================================
  Files         915      907       -8     
  Lines      143288   143033     -255     
==========================================
- Hits       106379   106308      -71     
+ Misses      36909    36725     -184     
Flag Coverage Δ
rust 74.32% <95.16%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/batch/src/executor/hash_agg.rs 93.08% <ø> (ø)
src/common/src/array/error.rs 0.00% <0.00%> (ø)
src/common/src/hash/key.rs 84.95% <100.00%> (+1.42%) ⬆️
src/stream/src/executor/hash_agg.rs 97.88% <100.00%> (ø)
src/stream/src/executor/managed_state/join/mod.rs 94.50% <100.00%> (+1.64%) ⬆️
src/source/src/connector_source.rs 68.46% <0.00%> (-14.98%) ⬇️
src/utils/pgwire/src/pg_response.rs 68.29% <0.00%> (-1.83%) ⬇️
src/meta/src/hummock/compaction_scheduler.rs 81.38% <0.00%> (-1.05%) ⬇️
src/meta/src/hummock/test_utils.rs 94.79% <0.00%> (-0.72%) ⬇️
...rontend/src/scheduler/distributed/query_manager.rs 13.63% <0.00%> (-0.39%) ⬇️
... and 75 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Gun9niR Gun9niR requested review from liurenjie1024 and fuyufjh and removed request for liurenjie1024 September 27, 2022 16:14
@fuyufjh
Copy link
Member

fuyufjh commented Sep 28, 2022

I'm afraid that requires key encoding aka. memcomparable. (I had wrote this in #5526)

Key encoding ensures the equality checking (a = b) is constant with encoded data (serialize(a) = serialize(a)). For example, if you run a Hash Join on A.d = B.d where d is a DECIMAL column, you have to ensure 1.0 can be joined with 1.00, while 1.0 and 1.00 have identical key encoding but different value encoding.

@@ -443,12 +456,12 @@ impl<'a> HashKeySerDe<'a> for StructRef<'a> {
type S = Vec<u8>;

/// This should never be called
fn serialize(self) -> Self::S {
fn fixed_size_serialize(self) -> Self::S {
todo!()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out of the scope of this PR, but this code is super anti-pattern 😇 As it said "this should never be called", then StructRef should not implement HashKeySerDe.

May need further investigation on how to refactor

Copy link
Contributor Author

@Gun9niR Gun9niR Sep 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more, the trait bound D: HashKeySerDe in HashKeySerializer::apend and HashKeyDeserializer::deserialize feels as if HashKeySerDe is only encoding that should be used, hence the comment on HashKeySerDe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's open another issue after this PR being merged

@Gun9niR
Copy link
Contributor Author

Gun9niR commented Sep 28, 2022

I'm afraid that requires key encoding aka. memcomparable. (I had wrote this in #5526)

Key encoding ensures the equality checking (a = b) is constant with encoded data (serialize(a) = serialize(a)). For example, if you run a Hash Join on A.d = B.d where d is a DECIMAL column, you have to ensure 1.0 can be joined with 1.00, while 1.0 and 1.00 have identical key encoding but different value encoding.

Didn't realize that. I was just thinking about performance 😢

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Gun9niR Gun9niR changed the title refactor(streaming): use value encoding in SerializedKey refactor(streaming): use key encoding in SerializedKey Sep 28, 2022
@Gun9niR Gun9niR requested a review from fuyufjh September 28, 2022 10:28
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor: SerializedKey should serialize key
3 participants