Move SMJ join filtered part out of join_output stage. LeftOuter, LeftSemi #12764

comphead · 2024-10-04T16:49:22Z

Which issue does this PR close?

Related to #11555
Related to #12359

Closes #.

Rationale for this change

Move filtered logic out of join_output stage.

The problem is for filtered joins there is extra step need, e.g to calculate final filtered mask for the specific row the algorithm, it needs the knowledge for every right row processed for the given left row.

For example for LeftOuter join :
select * from t1 join t2 on (t1.a = t2.a and t1.b > t2.b)

t1:

a	b
1	10

t2:

a	b
1	9
1	11

Currently the Join will output

a	b	a	b
1	10
1	10	1	11

which is incorrect, the first null matched row shouldn't be out as for given row there is a match exists.

The problem is records calculated depending on batch size and the output can be called anytime once the output size hit the batch size. So with batch size == 1 in this scenario the first row will be out because output == 1 which is equal to batch_size but the algorithm on this stage have no idea that another matched row is coming.

The idea of algorithm is to move filtered algorithm out of join stage and be not dependent on batch size. Instead it will be called once left row index switched to next index, to be sure every right row is processed for given left row, thus it will be still in batches although not strictly equal to batch_size.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead · 2024-10-04T16:50:51Z

@korowa @viirya Please have a look on prototype with just Left Outer join moved out of join output stage. Let me know your thoughts. With this approach I hope we can handle all other filtered cases like right, semi, anti

comphead · 2024-10-04T16:50:57Z

@alamb cc

comphead · 2024-10-04T16:52:06Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+    pub streamed_batch_counter: AtomicUsize,
+}
+
+struct JoinedRecordBatches {


we need more information to track the filtered bitmask across incoming batches

Could you please comment what the fields represent -- In particular how is filter_mask, row_indices and batch_ids interpreted relative to the batches (do they always have the same row count? What does the batch represent?

comphead · 2024-10-04T16:52:36Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                    seen_true = true;
+                    corrected_mask.append_value(true);
+                } else if seen_true || !filter_mask.value(i) && !last_index_for_row {
+                    corrected_mask.append_null(); // to be ignored and not set to output


NULL is the row shouldnot go to the output

comphead · 2024-10-04T16:52:51Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                } else if seen_true || !filter_mask.value(i) && !last_index_for_row {
+                    corrected_mask.append_null(); // to be ignored and not set to output
+                } else {
+                    corrected_mask.append_value(false); // to be converted to null joined row


false means for null joined rows

comphead · 2024-10-04T16:53:51Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -733,7 +786,23 @@ impl Stream for SMJStream {
                        match self.current_ordering {
                            Ordering::Less | Ordering::Equal => {
                                if !streamed_exhausted {
+                                    if self.join_type == JoinType::Left


this ifs is part of WIP. if the approach is okay we can move all join types under it and the only if stmt will be if the join is filtered or not

filtering phase happens when the left row index gets changed to ensure all right rows processed for the given row

comphead · 2024-10-04T16:54:44Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                            {
+                                record_batch
+                            } else {
+                                RecordBatch::new_empty(Arc::clone(&self.schema))


this block is magic, but without it the join stucks, I believe it is something with output sizes. We can improve it later

Maybe you could change it to continue to avoid outputting an empty batch 🤔

comphead · 2024-10-04T16:57:18Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

                            self.freeze_streamed()?;
                            self.join_metrics.input_batches.add(1);
                            self.join_metrics.input_rows.add(batch.num_rows());
                            self.streamed_batch =
                                StreamedBatch::new(batch, &self.on_streamed);
+                            self.streamed_batch_counter


we need to track batch ids, to prevent edge case like
Batch 0 value 1 streamed index 0
Batch 1 value 2 streamed index 0

So the streamed index is the same, and without batch id its not possible to figure out the actual values are different

Can you please put that context as comments of the docs of stream_batch_counter?

comphead · 2024-10-04T16:57:37Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -1330,10 +1433,10 @@ impl SMJStream {

            let columns = if matches!(self.join_type, JoinType::Right) {
                buffered_columns.extend(streamed_columns.clone());
-                buffered_columns
+                buffered_columns.clone()


clones will be removed

do you still plan to remove the clones?

comphead · 2024-10-04T16:58:23Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                    if self.join_type == JoinType::Left {
+                        self.output_record_batches
+                            .batches
+                            .push(output_batch.clone());


For Left WIP we pass the original non filtered batch because filtering will be done later.

comphead · 2024-10-04T16:58:44Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                            compute::filter_record_batch(&output_batch, mask)?;
+                        self.output_record_batches.batches.push(filtered_batch);
+                    }
+                    self.output_record_batches.filter_mask.extend(mask);


Collect all the neccessary extra information across batches

comphead · 2024-10-04T16:59:22Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

-                        .0;
+                    if matches!(self.join_type, JoinType::Right | JoinType::Full) {
+                        // The reverse of the selection mask. For the rows not pass join filter above,
+                        // we need to join them (left or right) with null rows for outer joins.


Its not needed it doesn't have left join, hopefully we can remove it

comphead · 2024-10-04T17:00:16Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -1520,9 +1639,79 @@ impl SMJStream {
        } else {
            self.output_size -= record_batch.num_rows();
        }
-        self.output_record_batches.clear();
+        if self.filter.is_none() || self.join_type != JoinType::Left {
+            self.output_record_batches.batches.clear();


for filtered joins we dont need to clear batches as we need them later, I'll do it rhough the flag later

comphead · 2024-10-04T17:01:18Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

        Ok(record_batch)
    }
+
+    fn filter_joined_batch(&mut self) -> Result<RecordBatch> {


here is the main logic: But in fact what it does it just consolidate batches, filtering information and does the filtering in very similar way we currently do in freeze_streamed

comphead · 2024-10-04T17:02:17Z

datafusion/core/tests/fuzz_cases/join_fuzz.rs

+}
+
+#[tokio::test]
+async fn test1() {


to be removed

comphead · 2024-10-04T17:02:41Z

datafusion/core/tests/fuzz_cases/join_fuzz.rs

@@ -134,8 +134,103 @@ async fn test_left_join_1k_filtered() {
        JoinType::Left,
        Some(Box::new(col_lt_col_filter)),
    )
-    .run_test(&[JoinTestType::NljHj], false)
+    .run_test(&[JoinTestType::HjSmj], false)


this test passes now

korowa · 2024-10-04T18:16:09Z

@comphead, I also was checking for these two issues, and found that currently SMJ contains some utility logic used in other join operators, but reimplements it in its own way. In addition the data processing could be implemented in more similar to other joins fashion -- currently it's kind of hard to track what's going on there.

I'm currently trying to rework it here, and I've got fuzz tests passing (but there are other issues), but it still needs some cleanups, proper working with spills and comments/docs.

comphead · 2024-10-04T19:42:16Z

@comphead, I also was checking for these two issues, and found that currently SMJ contains some utility logic used in other join operators, but reimplements it in its own way. In addition the data processing could be implemented in more similar to other joins fashion -- currently it's kind of hard to track what's going on there.

I'm currently trying to rework it here, and I've got fuzz tests passing (but there are other issues), but it still needs some cleanups, proper working with spills and comments/docs.

Thanks @korowa the way how SMJ implemented is truly hard to understand, it took me a couple of painful weeks to dig through it.

I feel you propose to go into 2 directions?
To unblock some projects that already relies on DF SMJ , we can fix the current SMJ Leftouter and other filtered variants with approach above or simialer and in parallel have your new implementation reviewed and onboarded?

korowa · 2024-10-06T15:17:10Z

I feel you propose to go into 2 directions?

@comphead sure, we can, they are not mutually exclusive and, likely, current approach will be completed and delivered faster, as it changes less things. I just wanted to say that (ideally, if possible) this fix could potentially be achieved by already reusing utility functions from other join implementations (like adjust_indices_by_join_type).

Regarding this fix -- it seems now there are two places doing almost the same thing (please correct me if I'm wrong)

filtering batches and correcting filter mask while staging output batches (applying filter and get_filtered_join_mask in freeze_all)
filtering batches and correcting mask while producing output batches (filter_joined_batch which calls get_corrected_filter_mask)

Also, get_filtered_join_mask is, likely, intended to do exactly the same thing as get_corrected_filter_mask due to its comment line:

/// This return a tuple of:
/// - corrected mask with respect to the join type

So, maybe there is a chance, that current implementation of filtering and aligning output indices according to the join type logic can be located in a single place for all join types, using the same functions, rather than being spreaded across two processing stages?

comphead · 2024-10-06T18:32:41Z

Thanks @korowa for the feedback, this is not a final PR, the goal was to discuss the direction, having LeftOuter filtered join as an example.

If we okay with the example I'm planning to move all the filtered logic into new single place and make the code cleaner.

Thanks for adjust_indices_by_join_type hint, I'll check if we can reuse it after getting the code restructured like mentioned above, most likely in separate PR to make things manageable.

comphead · 2024-10-08T21:45:55Z

@korowa I'm planning to move all other join variants to the same approach so the filtered logic will be in a single place, test it out and make the PR ready for your review

comphead · 2024-10-16T16:33:02Z

datafusion/core/tests/fuzz_cases/join_fuzz.rs

@@ -229,6 +227,7 @@ async fn test_anti_join_1k() {
 #[tokio::test]
 // flaky for HjSmj case, giving 1 rows difference sometimes
 // https://github.com/apache/datafusion/issues/11555
+#[ignore]


will be enabled when Left Anti has moved out of join partial as well

comphead · 2024-10-16T17:56:38Z

@korowa @viirya @alamb can I get the review please on this on the final PR moving the filtering SMJ logic out of join partial phase. Its done for Left/LeftSemi/Inner Join.

Right Joins, Full, Anti I'm planning to move right after this PR has approved

Please do not put much attention on disabled tests, as before the filtered join was unstable anyway

alamb · 2024-10-16T18:30:14Z

I will plan to review this either later today or tomorrow

comphead · 2024-10-16T19:16:34Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                        if self.filter.is_some()
+                            && matches!(
+                                self.join_type,
+                                JoinType::Left | JoinType::LeftSemi


there is multiple if statements of this kind just because we moved only 2 types of join

comphead · 2024-10-16T21:01:14Z

I will plan to review this either later today or tomorrow

Thanks @alamb

alamb

Thank you @comphead -- I think this seems like a nice improcement to me. I left some stylistic comments that I think would help encapsulation, etc

My only real concern about this PR is the commented out tests. I noticed that you say

Please do not put much attention on disabled tests, as before the filtered join was unstable anyway

I don't understand what this means. Do you mean that while the tests passed, the actual output of SortMergeJoin was likely not correct and thus the code with this change will not be any less correct.

I think once that is sorted out this PR would be good to go

alamb · 2024-10-17T18:24:10Z

datafusion/sqllogictest/test_files/sort_merge_join.slt

@@ -100,13 +100,14 @@ Alice 100 Alice 2
 Alice 50 Alice 1
 Alice 50 Alice 2

+# Uncomment when filtered RIGHT moved


does this mean that if we merge this PR the answers are incorrect (aka that we will be introducing a regression for some period of time?)

Speaking to risks, SMJ is experimental and disabled by default. Even now each filtered SMJ variant has a correctness issue depending on data distribution. Despite the fact that simple tests passed, the fuzz tests are commented out for filtered variants in main for the reason above .

I understand the concern, but moving all variants to new approach will make the PR unmanageable to review. I'll try to move other variants as fast as possible, moreover the most wide used variants(Inner, Left, LeftSemi) are moved in this PR

The issues #11555
#12359 focus on SMJ correctness issues. Fuzz tests flaky because filtered SMJ has a problem occurring spontaneously and depending on incoming data distribution. Answering your question, yes SMJ has issues and this PR is to make SMJ to work correctly and fix current bugs

datafusion/physical-plan/src/joins/sort_merge_join.rs

alamb · 2024-10-17T18:26:07Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+    pub streamed_batch_counter: AtomicUsize,
+}
+
+struct JoinedRecordBatches {


Could you please comment what the fields represent -- In particular how is filter_mask, row_indices and batch_ids interpreted relative to the batches (do they always have the same row count? What does the batch represent?

datafusion/physical-plan/src/joins/sort_merge_join.rs

alamb · 2024-10-17T18:28:54Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

                            self.freeze_streamed()?;
                            self.join_metrics.input_batches.add(1);
                            self.join_metrics.input_rows.add(batch.num_rows());
                            self.streamed_batch =
                                StreamedBatch::new(batch, &self.on_streamed);
+                            self.streamed_batch_counter


Can you please put that context as comments of the docs of stream_batch_counter?

alamb · 2024-10-17T18:29:21Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

@@ -1330,10 +1433,10 @@ impl SMJStream {

            let columns = if matches!(self.join_type, JoinType::Right) {
                buffered_columns.extend(streamed_columns.clone());
-                buffered_columns
+                buffered_columns.clone()


do you still plan to remove the clones?

alamb · 2024-10-17T18:30:57Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                                    {
+                                        self.freeze_all()?;
+
+                                        if !self.output_record_batches.batches.is_empty()


I wonder if adding some methods on JoinedRecordBatches would make this code easier to read - For example instead of directly accessing the fields, perhaps adding functions would allow

Suggested change

if !self.output_record_batches.batches.is_empty()

if !self.output_record_batches.is_empty()

That totally makes sense, I feel it requires more methods to add(add/is_empty/clear), I'd prefer to do it as follow up, this PR is too large imho

alamb · 2024-10-17T18:31:53Z

datafusion/physical-plan/src/joins/sort_merge_join.rs

+                            {
+                                record_batch
+                            } else {
+                                RecordBatch::new_empty(Arc::clone(&self.schema))


Maybe you could change it to continue to avoid outputting an empty batch 🤔

comphead · 2024-10-17T21:22:53Z

Thanks @alamb for the feedback I'll address the comments asap

alamb

This seems like an improvement to me - thank you @comphead

Let's get this in and keep iterating

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Oct 4, 2024

comphead commented Oct 4, 2024

View reviewed changes

datafusion/core/tests/fuzz_cases/join_fuzz.rs Outdated

}

#[tokio::test]

async fn test1() {

Copy link

Contributor Author

comphead Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be removed

comphead commented Oct 4, 2024

View reviewed changes

comphead requested review from viirya, korowa and alamb October 4, 2024 17:02

comphead changed the title ~~WIP: move filtered join out of join_output stage. LeftOuter experiment~~ WIP: move SMJ join filtered part out of join_output stage. LeftOuter experiment Oct 7, 2024

comphead force-pushed the dev branch from 2f880f2 to 14599fb Compare October 16, 2024 16:31

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 16, 2024

comphead commented Oct 16, 2024

View reviewed changes

comphead changed the title ~~WIP: move SMJ join filtered part out of join_output stage. LeftOuter experiment~~ Move SMJ join filtered part out of join_output stage. LeftOuter, LeftSemi Oct 16, 2024

comphead commented Oct 16, 2024

View reviewed changes

alamb reviewed Oct 17, 2024

View reviewed changes

comphead added 8 commits October 17, 2024 17:13

WIP: move filtered join out of join_output stage

a5ac189

WIP: move filtered join out of join_output stage

9d6342a

WIP: move filtered join out of join_output stage

3d9978f

cleanup

42e664c

cleanup

ea80038

Move Left/LeftAnti filtered SMJ join out of join partial stage

227ab72

Move Left/LeftAnti filtered SMJ join out of join partial stage

cad91c2

Address comments

3986741

comphead force-pushed the dev branch from b8e87ed to 3986741 Compare October 18, 2024 00:13

comphead requested a review from alamb October 18, 2024 00:36

alamb approved these changes Oct 18, 2024

View reviewed changes

alamb merged commit 3405234 into apache:main Oct 18, 2024
24 checks passed

alamb mentioned this pull request Oct 21, 2024

Oct 21, 2024: This week in DataFusion #13035

Closed

4 tasks

This was referenced Oct 22, 2024

Move SMJ filtered Right outer join out of join_partial phase comphead/arrow-datafusion#310

Closed

Move filtered SMJ right join out of join_partial phase #13053

Merged

Move filtered SMJ Left Anti filtered join out of join_partial phase #13111

Merged

	if !self.output_record_batches.batches.is_empty()
	if !self.output_record_batches.is_empty()

Move SMJ join filtered part out of join_output stage. LeftOuter, LeftSemi #12764

Move SMJ join filtered part out of join_output stage. LeftOuter, LeftSemi #12764

Conversation

comphead commented Oct 4, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

comphead commented Oct 4, 2024

comphead commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

korowa commented Oct 4, 2024

comphead commented Oct 4, 2024 • edited Loading

korowa commented Oct 6, 2024 • edited Loading

comphead commented Oct 6, 2024

comphead commented Oct 8, 2024

Choose a reason for hiding this comment

comphead commented Oct 16, 2024 • edited Loading

alamb commented Oct 16, 2024

comphead Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

comphead commented Oct 16, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

comphead Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Oct 17, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

comphead commented Oct 4, 2024 •

edited

Loading

comphead commented Oct 4, 2024 •

edited

Loading

korowa commented Oct 6, 2024 •

edited

Loading

comphead commented Oct 16, 2024 •

edited

Loading

comphead Oct 16, 2024 •

edited

Loading

comphead Oct 17, 2024 •

edited

Loading

comphead Oct 18, 2024 •

edited

Loading

comphead commented Oct 17, 2024 •

edited

Loading