-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424
Conversation
…ilter in rowset_iterator Signed-off-by: asuka <312856403@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. In fact, we can do a further optimization:
- Applying delete vector and applying filtering expression are both a kind of filter scan.
- Therefore, we can firstly generate delete map.
- Then, apply filter map to the delete map.
- And finally, do filter scan.
Therefore, if there are a lot of rows being deleted, we can leverage filter scan to avoid scanning deleted rows.
let (expr, filter_column) = self.filter_expr.as_ref().unwrap(); | ||
|
||
let has_filter = self.filter_expr.is_some(); | ||
let filter_context = if self.filter_expr.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks the same as:
let filter_context = self.filter_expr.as_ref();
if x != (row_id, array.len()) { | ||
panic!("unmatched rowid from column iterator"); | ||
for (id, column_ref) in self.column_refs.iter().enumerate() { | ||
if has_filter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer using if let Some(filter_context) = filter_context
, instead of combining .is_some()
with unwrap()
.
} | ||
} | ||
// Apply delete vector to filter_bitmap | ||
if !self.dvs.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check empty -- the for loop won't run if dvs is empty.
Updated. But I'm not sure if I've done this as you expected. |
I'll take a look tomorrow, thanks for your contribution! cc @likg227 might also take a look (if you have time). |
LGTM! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current approach doesn't seem very clear to me. What about applying the following idea in the scanning process?
- Firstly, we should determine a fetch_size (which has already been done in the current implementation.)
- After that, we will start to maintain a
visibility
Option<BitVec>
. If thisvisibility
variable is None, it means that at the current stage, we will scan all rows and won't skip any block. - In the first step, we generate a visibility map from the deletion vector of the current rowset. If there is no deletion vector,
visibility
remains to beNone
. - In the second step, we use the delete bitmap to scan the condition column in filter scan. (e.g., select * from table where a > 10, we scan the
a
column based on delete vector). If there are no filter conditions, we don't do any modification to the visibility map. - Now we will get a visibility map: case 1. without filter condition, representing the delete vector; case 2. with filter condition, representing delete vector + filter bitmap.
- At this time, we can use this visibility map to scan: case 1. without filter condition, scan all columns; case 2. with filter condition, scanning all remaining columns.
I think in this way, we can make the code more understandable -- we always maintain a visibility
bitmap, and modifying it little by little when scanning more and more columns.
If you have any question, feel free to discuss either at #risinglight-english or #risinglight-chinese channel.
Thanks for you contribution 🥰
visi.push(true); | ||
} | ||
for dv in &self.dvs { | ||
dv.apply_to(&mut visi, row_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can produce the delete vector visibility map before scanning the filter column? And we may only need to apply delete vector once :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are on the same track, the only problem is that I didn't know how to get the start row_id in this batch without scanning 1 column first, which made some codes confusing... Here the dv
would only be applied once as well, I guess. Maybe we could add an interface get_current_row_id
for the column_iterator, and call it on the first column to generate the visibility map before doin scan? And after that, adjust the following code logic according to your comment, which would make the logic more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add an interface get_current_row_id for the column_iterator
We definitely need a current_row_id
, but there are multiple places we can add it.
and call it on the first column
We can also record the current_row_id
in RowSetIterator
. See:
risinglight/src/storage/secondary/rowset/rowset_iterator.rs
Lines 37 to 40 in 53a3388
let start_row_id = match seek_pos { | |
ColumnSeekPosition::RowId(row_id) => row_id, | |
_ => todo!(), | |
}; |
Currently, the current_row_id
always starts at 0.
You may choose the best place to add the current_row_id
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought, maybe adding it on ColumnIterator
is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, since we already have https://github.com/risinglightdb/risinglight/blob/53a3388ec5/src/storage/secondary/column/concrete_column_iterator.rs#L46, the only work is to encapsulate a method in ColumnIterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we already have get_current_row_id
, but only exposed for test
🤪
Reworked the code just now, PTAL when you have time~ cc @skyzh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the bug, rest LGTM! Other minor fixes can be done in a separate PR (if you want). Thanks for this great work.
// Get the start row id first | ||
let start_row_id = { | ||
let mut row_id = 0; | ||
for (id, column_ref) in self.column_refs.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All column iterators will have the same row_id at any time -- I think it's okay to simply call fetch_current_row_id
for the 0
position column iterator.
// Initialize visibility map and apply delete vector to it | ||
let mut visi = BitVec::new(); | ||
visi.resize(fetch_size, true); | ||
for dv in &self.dvs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visi
can remain None
if self.dvs
is empty. We can do this optimization later.
}; | ||
|
||
let mut filter_bitmap = BitVec::with_capacity(bool_array.len()); | ||
for i in bool_array.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.enumerate()
let mut filter_bitmap = BitVec::with_capacity(bool_array.len()); | ||
for i in bool_array.iter() { | ||
if let Some(visi) = visibility_map.as_ref() { | ||
if !visi[filter_bitmap.len()] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(bug) should use the index of bool_array
instead of .len()
?
... also we can remove the original |
Alright, I would start a new PR tomorrow to do the stuff above |
Signed-off-by: asuka 312856403@qq.com
Close: #290