Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

Merged
merged 6 commits into from
Feb 9, 2022

Conversation

zzl200012
Copy link
Member

Signed-off-by: asuka 312856403@qq.com

Close: #290

…ilter in rowset_iterator

Signed-off-by: asuka <312856403@qq.com>
@skyzh skyzh requested review from likg227 and skyzh February 8, 2022 07:49
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. In fact, we can do a further optimization:

  • Applying delete vector and applying filtering expression are both a kind of filter scan.
  • Therefore, we can firstly generate delete map.
  • Then, apply filter map to the delete map.
  • And finally, do filter scan.

Therefore, if there are a lot of rows being deleted, we can leverage filter scan to avoid scanning deleted rows.

let (expr, filter_column) = self.filter_expr.as_ref().unwrap();

let has_filter = self.filter_expr.is_some();
let filter_context = if self.filter_expr.is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks the same as:

let filter_context = self.filter_expr.as_ref();

if x != (row_id, array.len()) {
panic!("unmatched rowid from column iterator");
for (id, column_ref) in self.column_refs.iter().enumerate() {
if has_filter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer using if let Some(filter_context) = filter_context, instead of combining .is_some() with unwrap().

}
}
// Apply delete vector to filter_bitmap
if !self.dvs.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check empty -- the for loop won't run if dvs is empty.

Signed-off-by: asuka <312856403@qq.com>
@zzl200012
Copy link
Member Author

Generally LGTM. In fact, we can do a further optimization:

  • Applying delete vector and applying filtering expression are both a kind of filter scan.
  • Therefore, we can firstly generate delete map.
  • Then, apply filter map to the delete map.
  • And finally, do filter scan.

Therefore, if there are a lot of rows being deleted, we can leverage filter scan to avoid scanning deleted rows.

Updated. But I'm not sure if I've done this as you expected.

@skyzh
Copy link
Member

skyzh commented Feb 8, 2022

I'll take a look tomorrow, thanks for your contribution!

cc @likg227 might also take a look (if you have time).

@likg227
Copy link
Contributor

likg227 commented Feb 8, 2022

LGTM!
It's better to write some unit tests for this pr.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach doesn't seem very clear to me. What about applying the following idea in the scanning process?

  • Firstly, we should determine a fetch_size (which has already been done in the current implementation.)
  • After that, we will start to maintain a visibility Option<BitVec>. If this visibility variable is None, it means that at the current stage, we will scan all rows and won't skip any block.
  • In the first step, we generate a visibility map from the deletion vector of the current rowset. If there is no deletion vector, visibility remains to be None.
  • In the second step, we use the delete bitmap to scan the condition column in filter scan. (e.g., select * from table where a > 10, we scan the a column based on delete vector). If there are no filter conditions, we don't do any modification to the visibility map.
  • Now we will get a visibility map: case 1. without filter condition, representing the delete vector; case 2. with filter condition, representing delete vector + filter bitmap.
  • At this time, we can use this visibility map to scan: case 1. without filter condition, scan all columns; case 2. with filter condition, scanning all remaining columns.

I think in this way, we can make the code more understandable -- we always maintain a visibility bitmap, and modifying it little by little when scanning more and more columns.

If you have any question, feel free to discuss either at #risinglight-english or #risinglight-chinese channel.

Thanks for you contribution 🥰

visi.push(true);
}
for dv in &self.dvs {
dv.apply_to(&mut visi, row_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can produce the delete vector visibility map before scanning the filter column? And we may only need to apply delete vector once :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are on the same track, the only problem is that I didn't know how to get the start row_id in this batch without scanning 1 column first, which made some codes confusing... Here the dv would only be applied once as well, I guess. Maybe we could add an interface get_current_row_id for the column_iterator, and call it on the first column to generate the visibility map before doin scan? And after that, adjust the following code logic according to your comment, which would make the logic more clear.

Copy link
Member

@skyzh skyzh Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add an interface get_current_row_id for the column_iterator

We definitely need a current_row_id, but there are multiple places we can add it.

and call it on the first column

We can also record the current_row_id in RowSetIterator. See:

let start_row_id = match seek_pos {
ColumnSeekPosition::RowId(row_id) => row_id,
_ => todo!(),
};

Currently, the current_row_id always starts at 0.

You may choose the best place to add the current_row_id variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, maybe adding it on ColumnIterator is a better way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, since we already have https://github.com/risinglightdb/risinglight/blob/53a3388ec5/src/storage/secondary/column/concrete_column_iterator.rs#L46, the only work is to encapsulate a method in ColumnIterator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we already have get_current_row_id, but only exposed for test 🤪

Signed-off-by: asuka <312856403@qq.com>
@zzl200012
Copy link
Member Author

Reworked the code just now, PTAL when you have time~ cc @skyzh

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the bug, rest LGTM! Other minor fixes can be done in a separate PR (if you want). Thanks for this great work.

// Get the start row id first
let start_row_id = {
let mut row_id = 0;
for (id, column_ref) in self.column_refs.iter().enumerate() {
Copy link
Member

@skyzh skyzh Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All column iterators will have the same row_id at any time -- I think it's okay to simply call fetch_current_row_id for the 0 position column iterator.

// Initialize visibility map and apply delete vector to it
let mut visi = BitVec::new();
visi.resize(fetch_size, true);
for dv in &self.dvs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visi can remain None if self.dvs is empty. We can do this optimization later.

};

let mut filter_bitmap = BitVec::with_capacity(bool_array.len());
for i in bool_array.iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.enumerate()

let mut filter_bitmap = BitVec::with_capacity(bool_array.len());
for i in bool_array.iter() {
if let Some(visi) = visibility_map.as_ref() {
if !visi[filter_bitmap.len()] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bug) should use the index of bool_array instead of .len()?

@skyzh
Copy link
Member

skyzh commented Feb 9, 2022

... also we can remove the original get_current_row_id function (maybe in another PR?)

Signed-off-by: asuka <312856403@qq.com>
@zzl200012
Copy link
Member Author

... also we can remove the original get_current_row_id function (maybe in another PR?)

Alright, I would start a new PR tomorrow to do the stuff above

@skyzh skyzh enabled auto-merge (squash) February 9, 2022 10:09
@skyzh skyzh merged commit 1a026b6 into risinglightdb:main Feb 9, 2022
@zzl200012 zzl200012 deleted the 0208-1 branch February 15, 2022 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator
3 participants