Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): implement read pruning by vnode #2882

Merged
merged 18 commits into from
Jun 9, 2022
Merged

Conversation

xx01cyx
Copy link
Contributor

@xx01cyx xx01cyx commented May 27, 2022

What's changed and what's your intention?

Summarize your change

  • Add vnode parameter to read-interfaces of keyspace and state store.
  • Use vnode bitmap info to initialize keyspace.
  • Use new keyspace (the one with vnode) in certain streaming executors.

After this PR gets merged, read pruning by vnode will work properly in both point-get and range-scan.

Limitations

Read pruning does NOT work in batch executor yet. This will be implemented in the future.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a new set of interface (scan_with_vnode) instead of changing existing one. We should migrate little by little.

@skyzh
Copy link
Contributor

skyzh commented May 28, 2022

And I think vnode information should be recorded on Keyspace (keyspace::new_with_vnode) instead of passing it everywhere.

@xx01cyx
Copy link
Contributor Author

xx01cyx commented May 28, 2022

I would prefer to have a new set of interface (scan_with_vnode) instead of changing existing one. We should migrate little by little.

Indeed. I'll fix this.

@xx01cyx
Copy link
Contributor Author

xx01cyx commented May 28, 2022

And I think vnode information should be recorded on Keyspace (keyspace::new_with_vnode) instead of passing it everywhere.

The vnodes that one executor owns are likely to change when the cluster scales in or scales out. Then we'll have to maintain the vnode info in a multi-version way in keyspace.

@skyzh
Copy link
Contributor

skyzh commented May 28, 2022

The vnodes that one executor owns are likely to change when the cluster scales in or scales out

If there's scale-in and scale-out, the executor will be re-created. 😇🥰

@codecov
Copy link

codecov bot commented May 28, 2022

Codecov Report

Merging #2882 (8b83502) into main (9af55da) will decrease coverage by 0.04%.
The diff coverage is 60.09%.

❗ Current head 8b83502 differs from pull request most recent head 6d3797d. Consider uploading reports for the commit 6d3797d to get more accurate results

@@            Coverage Diff             @@
##             main    #2882      +/-   ##
==========================================
- Coverage   73.47%   73.42%   -0.05%     
==========================================
  Files         736      736              
  Lines      100716   101010     +294     
==========================================
+ Hits        73997    74163     +166     
- Misses      26719    26847     +128     
Flag Coverage Δ
rust 73.42% <60.09%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/bench/ss_bench/operations/get.rs 0.00% <0.00%> (ø)
...rc/bench/ss_bench/operations/prefix_scan_random.rs 0.00% <0.00%> (ø)
src/common/src/hash/dispatcher.rs 91.89% <ø> (ø)
src/common/src/hash/key.rs 85.38% <ø> (ø)
src/ctl/src/cmd_impl/hummock/list_kv.rs 0.00% <0.00%> (ø)
src/meta/src/manager/hash_mapping.rs 97.39% <ø> (ø)
src/meta/src/stream/meta.rs 47.04% <ø> (ø)
src/meta/src/stream/scheduler.rs 88.53% <ø> (ø)
src/meta/src/stream/stream_manager.rs 68.82% <ø> (ø)
src/storage/src/hummock/snapshot_tests.rs 94.68% <ø> (ø)
... and 30 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@xx01cyx
Copy link
Contributor Author

xx01cyx commented May 28, 2022

If there's scale-in and scale-out, the executor will be re-created. 😇🥰

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

@skyzh
Copy link
Contributor

skyzh commented May 28, 2022

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

@xx01cyx
Copy link
Contributor Author

xx01cyx commented May 28, 2022

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

If a fragment scales out from 5 parallel degrees to 10, the number of vnodes owned by one parallel unit will inevitably decrease by half (since total number of vnodes is invariant). How to ensure that new executors would always include their previous vnodes?

@xx01cyx
Copy link
Contributor Author

xx01cyx commented May 28, 2022

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

I think I get what you mean. new executor vnode set = UNION OF previous vnode set AND current vnode set, until all relevant compactions are done, right?

@skyzh
Copy link
Contributor

skyzh commented May 28, 2022

Well, my fault, please ignore my comments.

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

This should never happen. Executors will only read data belonging to its own distribution. During scale-out, executors will operate on a complete different set of keys. Therefore, they will not query data written before.

@skyzh
Copy link
Contributor

skyzh commented May 28, 2022

And we do not need to include previous vnode.

@xx01cyx xx01cyx requested a review from fuyufjh May 30, 2022 01:31
&'a self,
key: &'a [u8],
epoch: u64,
_vnode: Option<VirtualNode>,
Copy link
Member

@fuyufjh fuyufjh May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering there might not be such a "PointGet" operator, I think the type of vnodes should also be Vec<VirtualNode>. Nevermind, it's not a big problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer always using &'a VirtualNode, so that it will function efficiently even when vnode mapping is large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer always using &'a VirtualNode, so that it will function efficiently even when vnode mapping is large.

Will Option<&'a VirtualNode> still cause some overhead due to the construction of Option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it has exactly the same value size as &'a VirtualNode.

Comment on lines 149 to 157
pub async fn get_with_vnode(
&self,
key: impl AsRef<[u8]>,
epoch: u64,
vnode: VirtualNode,
) -> StorageResult<Option<Bytes>> {
self.store
.get(&self.prefixed_key(key), epoch, Some(vnode))
.await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function uses the given vnode instead of self.vnode. What scenario should it be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used when an executor does point-get with vnode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think point get can already be optimized by bloom filter (only 0.01 false positive currently). Maybe we don't need vnode for it. But it would also be okay to use vnode to do some sanity check -- e.g. executors should not point get keys out of its vnode range.

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -230,15 +231,15 @@ mod tests {
vec![DataType::Int64].into(),
);
assert!(!managed_state.is_dirty());
let columns = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally reverted the change?

@@ -170,16 +185,34 @@ impl<S: StateStore> CellBasedTable<S> {

pub async fn get_row_by_scan(&self, pk: &Row, epoch: u64) -> StorageResult<Option<Row>> {
// get row by state_store scan
let vnode = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why CellBasedTable need to compute vnode? I think this should be provided by executors creating CellBasedTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that cell based table provides such interface:

async fn batch_write_rows_inner<const WITH_VALUE_META: bool>

which I think indicates whether to compute value meta in cell based table. 🤔

Copy link
Contributor

@skyzh skyzh May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value meta needs to be computed when write of course. But for reads, isn't it true that all executors and their state table objects already have value meta assigned to them? For both point get and scan, we should use vnode provided by executors to do filters, instead of compute it.

Copy link
Contributor Author

@xx01cyx xx01cyx May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should use the same set of vnodes to do pruning, regardless of type of the read operation (point-get or range-scan). Will this lead to any inefficiency that could be avoided (e.g. less SSTs are pruned out) when we do point-get? cc. @fuyufjh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

vnode: VirtualNode,
) -> StorageResult<Option<Bytes>> {
// Construct vnode bitmap.
let mut bitmap_inner = [0; VNODE_BITMAP_LEN];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems to appear in multiple places. Is it possible to have a VNodeBitmap::new(vnode, table_id), let the caller to provide a VNodeBitmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VNodeBitmap is actually a proto type. Maybe we should define a non-proto type for it.

@xx01cyx xx01cyx changed the title feat(storage): add vnode to read-interface of keyspace and state store feat(storage): implement read pruning by vnode May 30, 2022
@hzxa21 hzxa21 self-requested a review May 31, 2022 05:20
@@ -170,16 +185,34 @@ impl<S: StateStore> CellBasedTable<S> {

pub async fn get_row_by_scan(&self, pk: &Row, epoch: u64) -> StorageResult<Option<Row>> {
// get row by state_store scan
let vnode = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

@skyzh
Copy link
Contributor

skyzh commented Jun 1, 2022

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

I believe bloom filter can already achieve a relatively low false negative. I would prefer use executor-provided vnode in all cases.

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 9, 2022

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

I believe bloom filter can already achieve a relatively low false negative. I would prefer use executor-provided vnode in all cases.

Correct me if i am wrong, after a second thought, I think there is no such case that we don't know dist key beforehand. Therefore, we should always compute and provide a single vnode to the read interface.

@xx01cyx xx01cyx enabled auto-merge (squash) June 9, 2022 12:57
@xx01cyx xx01cyx merged commit 49c207d into main Jun 9, 2022
@xx01cyx xx01cyx deleted the cyx/read-by-vnode-api branch June 9, 2022 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants