refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

zzl200012 · 2022-02-08T07:37:49Z

Signed-off-by: asuka 312856403@qq.com

Close: #290

…ilter in rowset_iterator Signed-off-by: asuka <312856403@qq.com>

skyzh

Generally LGTM. In fact, we can do a further optimization:

Applying delete vector and applying filtering expression are both a kind of filter scan.
Therefore, we can firstly generate delete map.
Then, apply filter map to the delete map.
And finally, do filter scan.

Therefore, if there are a lot of rows being deleted, we can leverage filter scan to avoid scanning deleted rows.

skyzh · 2022-02-08T08:34:02Z

src/storage/secondary/rowset/rowset_iterator.rs

-        let (expr, filter_column) = self.filter_expr.as_ref().unwrap();
-
+        let has_filter = self.filter_expr.is_some();
+        let filter_context = if self.filter_expr.is_some() {


Looks the same as:

let filter_context = self.filter_expr.as_ref();

skyzh · 2022-02-08T08:34:38Z

src/storage/secondary/rowset/rowset_iterator.rs

-                        if x != (row_id, array.len()) {
-                            panic!("unmatched rowid from column iterator");
+        for (id, column_ref) in self.column_refs.iter().enumerate() {
+            if has_filter {


I would prefer using if let Some(filter_context) = filter_context, instead of combining .is_some() with unwrap().

skyzh · 2022-02-08T08:35:23Z

src/storage/secondary/rowset/rowset_iterator.rs

-                        }
-                    }
+            // Apply delete vector to filter_bitmap
+            if !self.dvs.is_empty() {


No need to check empty -- the for loop won't run if dvs is empty.

Signed-off-by: asuka <312856403@qq.com>

zzl200012 · 2022-02-08T11:17:00Z

Generally LGTM. In fact, we can do a further optimization:

Applying delete vector and applying filtering expression are both a kind of filter scan.

Therefore, we can firstly generate delete map.

Then, apply filter map to the delete map.

And finally, do filter scan.

Therefore, if there are a lot of rows being deleted, we can leverage filter scan to avoid scanning deleted rows.

Updated. But I'm not sure if I've done this as you expected.

skyzh · 2022-02-08T11:27:02Z

I'll take a look tomorrow, thanks for your contribution!

cc @likg227 might also take a look (if you have time).

likg227 · 2022-02-08T12:54:06Z

LGTM！
It's better to write some unit tests for this pr.

skyzh

I think the current approach doesn't seem very clear to me. What about applying the following idea in the scanning process?

Firstly, we should determine a fetch_size (which has already been done in the current implementation.)
After that, we will start to maintain a visibility Option<BitVec>. If this visibility variable is None, it means that at the current stage, we will scan all rows and won't skip any block.
In the first step, we generate a visibility map from the deletion vector of the current rowset. If there is no deletion vector, visibility remains to be None.
In the second step, we use the delete bitmap to scan the condition column in filter scan. (e.g., select * from table where a > 10, we scan the a column based on delete vector). If there are no filter conditions, we don't do any modification to the visibility map.
Now we will get a visibility map: case 1. without filter condition, representing the delete vector; case 2. with filter condition, representing delete vector + filter bitmap.
At this time, we can use this visibility map to scan: case 1. without filter condition, scan all columns; case 2. with filter condition, scanning all remaining columns.

I think in this way, we can make the code more understandable -- we always maintain a visibility bitmap, and modifying it little by little when scanning more and more columns.

If you have any question, feel free to discuss either at #risinglight-english or #risinglight-chinese channel.

Thanks for you contribution 🥰

skyzh · 2022-02-09T05:44:57Z

src/storage/secondary/rowset/rowset_iterator.rs

+                                visi.push(true);
+                            }
+                            for dv in &self.dvs {
+                                dv.apply_to(&mut visi, row_id);


We can produce the delete vector visibility map before scanning the filter column? And we may only need to apply delete vector once :)

I think we are on the same track, the only problem is that I didn't know how to get the start row_id in this batch without scanning 1 column first, which made some codes confusing... Here the dv would only be applied once as well, I guess. Maybe we could add an interface get_current_row_id for the column_iterator, and call it on the first column to generate the visibility map before doin scan? And after that, adjust the following code logic according to your comment, which would make the logic more clear.

Maybe we could add an interface get_current_row_id for the column_iterator

We definitely need a current_row_id, but there are multiple places we can add it.

and call it on the first column

~~We can also record the current_row_id in RowSetIterator. See:~~

risinglight/src/storage/secondary/rowset/rowset_iterator.rs

Lines 37 to 40 in 53a3388

let start_row_id = match seek_pos {

ColumnSeekPosition::RowId(row_id) => row_id,

_ => todo!(),

};

Currently, the current_row_id always starts at 0.

You may choose the best place to add the current_row_id variable.

After a second thought, maybe adding it on ColumnIterator is a better way.

Agree, since we already have https://github.com/risinglightdb/risinglight/blob/53a3388ec5/src/storage/secondary/column/concrete_column_iterator.rs#L46, the only work is to encapsulate a method in ColumnIterator

It seems that we already have get_current_row_id, but only exposed for test 🤪

Signed-off-by: asuka <312856403@qq.com>

zzl200012 · 2022-02-09T08:46:45Z

Reworked the code just now, PTAL when you have time~ cc @skyzh

skyzh

Despite the bug, rest LGTM! Other minor fixes can be done in a separate PR (if you want). Thanks for this great work.

skyzh · 2022-02-09T09:50:46Z

src/storage/secondary/rowset/rowset_iterator.rs

+            // Get the start row id first
+            let start_row_id = {
+                let mut row_id = 0;
+                for (id, column_ref) in self.column_refs.iter().enumerate() {


All column iterators will have the same row_id at any time -- I think it's okay to simply call fetch_current_row_id for the 0 position column iterator.

skyzh · 2022-02-09T09:51:16Z

src/storage/secondary/rowset/rowset_iterator.rs

+            // Initialize visibility map and apply delete vector to it
+            let mut visi = BitVec::new();
+            visi.resize(fetch_size, true);
+            for dv in &self.dvs {


The visi can remain None if self.dvs is empty. We can do this optimization later.

skyzh · 2022-02-09T09:53:06Z

src/storage/secondary/rowset/rowset_iterator.rs

+            };
+
+            let mut filter_bitmap = BitVec::with_capacity(bool_array.len());
+            for i in bool_array.iter() {


.enumerate()

skyzh · 2022-02-09T09:53:16Z

src/storage/secondary/rowset/rowset_iterator.rs

+            let mut filter_bitmap = BitVec::with_capacity(bool_array.len());
+            for i in bool_array.iter() {
+                if let Some(visi) = visibility_map.as_ref() {
+                    if !visi[filter_bitmap.len()] {


(bug) should use the index of bool_array instead of .len()?

skyzh · 2022-02-09T10:01:06Z

... also we can remove the original get_current_row_id function (maybe in another PR?)

Signed-off-by: asuka <312856403@qq.com>

zzl200012 · 2022-02-09T10:06:45Z

... also we can remove the original get_current_row_id function (maybe in another PR?)

Alright, I would start a new PR tomorrow to do the stuff above

refactor(storage): merge next_batch_inner and next_batch_inner_with_f…

ad2f03e

…ilter in rowset_iterator Signed-off-by: asuka <312856403@qq.com>

skyzh requested review from likg227 and skyzh February 8, 2022 07:49

skyzh reviewed Feb 8, 2022

View reviewed changes

update

9c05830

Signed-off-by: asuka <312856403@qq.com>

skyzh reviewed Feb 9, 2022

View reviewed changes

rework

7d61cf5

Signed-off-by: asuka <312856403@qq.com>

skyzh reviewed Feb 9, 2022

View reviewed changes

fix bug

cb88cc6

Signed-off-by: asuka <312856403@qq.com>

skyzh approved these changes Feb 9, 2022

View reviewed changes

Merge branch 'main' into 0208-1

58f5df9

skyzh enabled auto-merge (squash) February 9, 2022 10:09

Merge branch 'main' into 0208-1

7a5246a

skyzh merged commit 1a026b6 into risinglightdb:main Feb 9, 2022

zzl200012 mentioned this pull request Feb 10, 2022

chore(storage): remove redundant get_current_row_id #438

Merged

zzl200012 deleted the 0208-1 branch February 15, 2022 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

zzl200012 commented Feb 8, 2022

skyzh left a comment

skyzh Feb 8, 2022

skyzh Feb 8, 2022

skyzh Feb 8, 2022

zzl200012 commented Feb 8, 2022

skyzh commented Feb 8, 2022 •

edited

Loading

likg227 commented Feb 8, 2022

skyzh left a comment •

edited

Loading

skyzh Feb 9, 2022

zzl200012 Feb 9, 2022

skyzh Feb 9, 2022 •

edited

Loading

skyzh Feb 9, 2022

zzl200012 Feb 9, 2022

skyzh Feb 9, 2022

zzl200012 commented Feb 9, 2022

skyzh left a comment

skyzh Feb 9, 2022 •

edited

Loading

skyzh Feb 9, 2022

skyzh Feb 9, 2022

skyzh Feb 9, 2022

skyzh commented Feb 9, 2022

zzl200012 commented Feb 9, 2022

	let start_row_id = match seek_pos {
	ColumnSeekPosition::RowId(row_id) => row_id,
	_ => todo!(),
	};

refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

refactor(storage): merge next_batch_inner and next_batch_inner_with_filter in rowset_iterator #424

Conversation

zzl200012 commented Feb 8, 2022

skyzh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzl200012 commented Feb 8, 2022

skyzh commented Feb 8, 2022 • edited Loading

likg227 commented Feb 8, 2022

skyzh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zzl200012 commented Feb 9, 2022

skyzh left a comment

Choose a reason for hiding this comment

skyzh Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented Feb 9, 2022

zzl200012 commented Feb 9, 2022

skyzh commented Feb 8, 2022 •

edited

Loading

skyzh left a comment •

edited

Loading

skyzh Feb 9, 2022 •

edited

Loading

skyzh Feb 9, 2022 •

edited

Loading