Adjust Operators to be Pausable #13694

imply-cheddar · 2023-01-19T01:15:53Z

This enables "merge" style operations that combine multiple streams.

This change includes a naive implementation of one such merge operator just to provide concrete evidence that the refactoring is effective.

The primary intent of the change can be seen by looking at the Operator interface. The change amounts to the
addition of a Signal enumeration that allows the Receiver to signal back to the Operator that it needs to pause
things. This then returns control to the caller to do something else (like call another Operator). The vast majority of
operators should never have to deal with this, but some that do merges will.

This interface change has made it possible to implement the Yielder interface in terms of an Operator again, which is a proof-point that this works. On top of that, there is a concrete implementation of such a merge operator in SortedInnerJoinOperator, which implements a multi-way inner join across operators. The class is almost entirely business logic for conducting the join rather than interactions with the Operator interface, so it is relatively large, but the intent was to prove that it is possible to create a meaningful "merging" operator on top of the interface changes, which it achieves. I would recommend starting with the SortedInnerJoinOperatorTest when reviewing just because that will show the intended usage of the operator.

Note, however, that this operator was created merely to exercise the interface change, as such it is not wired up into
planning no is there a query path that can exercise it yet. Now that I am certain the interface can handle all of the
needs of the data processing pipeline, I intend to go back to fleshing out the full test suite for window functions
and then can come back to this.

In terms of validation of the interface, there is likely one more validation to conduct, though I am fairly certain that it will be relatively simple and leave as a future exercise: the implementation of a FrameProcessor based on Operator.

This PR has:

been self-reviewed.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

imply-cheddar · 2023-01-19T03:09:47Z

processing/src/main/java/org/apache/druid/query/rowsandcols/RearrangedRowsAndColumns.java

@@ -72,13 +85,14 @@
  @Override
  public int numRows()
  {
-    return pointers.length;
+    return end - start;


There's a validation that this is positive, so this check should be able to be ignored.

processing/src/main/java/org/apache/druid/query/operator/join/SortedInnerJoinOperator.java

This enables "merge" style operations that combine multiple streams. This change includes a naive implementation of one such merge operator just to provide concrete evidence that the refactoring is effective.

clintropolis

did not review join operator as anything other than a proof of concept for the pattern (since missing some bits if viewed from a 'real' perspective like conditions other than equality, etc)

no blockers on my end and this code is all pretty well isolated and behind flags so going to go ahead and approve

clintropolis · 2023-01-23T23:40:50Z

processing/src/main/java/org/apache/druid/query/rowsandcols/column/DoubleArrayColumn.java

@@ -126,4 +74,122 @@ public <T> T as(Class<? extends T> clazz)
    }
    return null;
  }
+
+  private class MyColumnAccessor implements BinarySearchableAccessor


should this only be a binary searchable accessor if the column is sorted?

Generically speaking, yes it would be best to validate that first before returning the thing.

As a confounding factor, even if the whole column isn't sorted, it could be sorted in the context of another column. I.e. if a set of rows are sorted by (col1, col2), then col2 is not actually sorted if you look at the data, but when you access it in conjunction with col1, it is. I agree that we should do more to figure this sort of thing out and fail if it's violated, but, for now, the code is counting on the query planning to do things correctly

clintropolis · 2023-01-23T23:43:22Z

...essing/src/main/java/org/apache/druid/query/rowsandcols/column/BinarySearchableAccessor.java

+
+  FindResult findString(int startIndex, int endIndex, String val);
+
+  FindResult findComplex(int startIndex, int endIndex, Object val);


should this be findObject instead of findComplex, that is if it should also handle other object types such as ARRAY, ARRAY, ARRAY<ARRAY<...>> etc? Or do you imagine array types will have some other method?

I imagine the Array types will have their own methods.

clintropolis · 2023-01-23T23:53:31Z

processing/src/main/java/org/apache/druid/query/rowsandcols/util/FindResult.java

+
+package org.apache.druid.query.rowsandcols.util;
+
+public class FindResult


i know this makes stuff nicer, but wondering is this maybe expensive if we need to find a lot of things compared to just dealing in int?

Just thinking out loud, probably don't need to worry about it right now

"dealing in int" would actually mean dealing in int[] because I am often returning a start/end range. Once we are dealing in int[], we have the overhead of a reference to an object anyway, so this seemed okay.

If we really wanted to fix this, we'd likely need to make the finder thingie itself stateful and have getters on that. That would probably be better, but can be an activity for a later day.

* Adjust Operators to be Pausable This enables "merge" style operations that combine multiple streams. This change includes a naive implementation of one such merge operator just to provide concrete evidence that the refactoring is effective.

github-advanced-security bot found potential problems Jan 19, 2023

View reviewed changes

imply-cheddar force-pushed the pausable-operators branch from 5d17add to c86941a Compare January 19, 2023 03:17

Adjust Operators to be Pausable

01047ca

This enables "merge" style operations that combine multiple streams. This change includes a naive implementation of one such merge operator just to provide concrete evidence that the refactoring is effective.

imply-cheddar force-pushed the pausable-operators branch from c86941a to 01047ca Compare January 19, 2023 03:22

imply-cheddar added 2 commits January 20, 2023 11:50

Test coverage

f11a26a

Checkstyle

6cfa934

clintropolis approved these changes Jan 24, 2023

View reviewed changes

cheddar merged commit 706b8a0 into apache:master Jan 24, 2023

imply-cheddar deleted the pausable-operators branch January 24, 2023 04:52

clintropolis added the Area - Querying label Feb 2, 2023

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust Operators to be Pausable #13694

Adjust Operators to be Pausable #13694

imply-cheddar commented Jan 19, 2023 •

edited

Loading

imply-cheddar Jan 19, 2023

clintropolis left a comment

clintropolis Jan 23, 2023

imply-cheddar Jan 24, 2023

clintropolis Jan 23, 2023

imply-cheddar Jan 24, 2023

clintropolis Jan 23, 2023

imply-cheddar Jan 24, 2023


		FindResult findString(int startIndex, int endIndex, String val);

		FindResult findComplex(int startIndex, int endIndex, Object val);


		package org.apache.druid.query.rowsandcols.util;

		public class FindResult

Adjust Operators to be Pausable #13694

Adjust Operators to be Pausable #13694

Conversation

imply-cheddar commented Jan 19, 2023 • edited Loading

imply-cheddar Jan 19, 2023

Choose a reason for hiding this comment

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis Jan 23, 2023

Choose a reason for hiding this comment

imply-cheddar Jan 24, 2023

Choose a reason for hiding this comment

clintropolis Jan 23, 2023

Choose a reason for hiding this comment

imply-cheddar Jan 24, 2023

Choose a reason for hiding this comment

clintropolis Jan 23, 2023

Choose a reason for hiding this comment

imply-cheddar Jan 24, 2023

Choose a reason for hiding this comment

imply-cheddar commented Jan 19, 2023 •

edited

Loading