Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust Operators to be Pausable #13694

Merged
merged 3 commits into from
Jan 24, 2023
Merged

Conversation

imply-cheddar
Copy link
Contributor

@imply-cheddar imply-cheddar commented Jan 19, 2023

This enables "merge" style operations that combine multiple streams.

This change includes a naive implementation of one such merge operator just to provide concrete evidence that the refactoring is effective.

The primary intent of the change can be seen by looking at the Operator interface. The change amounts to the
addition of a Signal enumeration that allows the Receiver to signal back to the Operator that it needs to pause
things. This then returns control to the caller to do something else (like call another Operator). The vast majority of
operators should never have to deal with this, but some that do merges will.

This interface change has made it possible to implement the Yielder interface in terms of an Operator again, which is a proof-point that this works. On top of that, there is a concrete implementation of such a merge operator in SortedInnerJoinOperator, which implements a multi-way inner join across operators. The class is almost entirely business logic for conducting the join rather than interactions with the Operator interface, so it is relatively large, but the intent was to prove that it is possible to create a meaningful "merging" operator on top of the interface changes, which it achieves. I would recommend starting with the SortedInnerJoinOperatorTest when reviewing just because that will show the intended usage of the operator.

Note, however, that this operator was created merely to exercise the interface change, as such it is not wired up into
planning no is there a query path that can exercise it yet. Now that I am certain the interface can handle all of the
needs of the data processing pipeline, I intend to go back to fleshing out the full test suite for window functions
and then can come back to this.

In terms of validation of the interface, there is likely one more validation to conduct, though I am fairly certain that it will be relatively simple and leave as a future exercise: the implementation of a FrameProcessor based on Operator.

This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@@ -72,13 +85,14 @@
@Override
public int numRows()
{
return pointers.length;
return end - start;

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](3), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](4), potentially causing an underflow. This arithmetic expression depends on a [user-provided value](5), potentially causing an underflow.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a validation that this is positive, so this check should be able to be ignored.

This enables "merge" style operations that
combine multiple streams.

This change includes a naive implementation
of one such merge operator just to provide
concrete evidence that the refactoring is
effective.
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not review join operator as anything other than a proof of concept for the pattern (since missing some bits if viewed from a 'real' perspective like conditions other than equality, etc)

no blockers on my end and this code is all pretty well isolated and behind flags so going to go ahead and approve

@@ -126,4 +74,122 @@ public <T> T as(Class<? extends T> clazz)
}
return null;
}

private class MyColumnAccessor implements BinarySearchableAccessor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this only be a binary searchable accessor if the column is sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generically speaking, yes it would be best to validate that first before returning the thing.

As a confounding factor, even if the whole column isn't sorted, it could be sorted in the context of another column. I.e. if a set of rows are sorted by (col1, col2), then col2 is not actually sorted if you look at the data, but when you access it in conjunction with col1, it is. I agree that we should do more to figure this sort of thing out and fail if it's violated, but, for now, the code is counting on the query planning to do things correctly


FindResult findString(int startIndex, int endIndex, String val);

FindResult findComplex(int startIndex, int endIndex, Object val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be findObject instead of findComplex, that is if it should also handle other object types such as ARRAY, ARRAY, ARRAY<ARRAY<...>> etc? Or do you imagine array types will have some other method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine the Array types will have their own methods.


package org.apache.druid.query.rowsandcols.util;

public class FindResult
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this makes stuff nicer, but wondering is this maybe expensive if we need to find a lot of things compared to just dealing in int?

Just thinking out loud, probably don't need to worry about it right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dealing in int" would actually mean dealing in int[] because I am often returning a start/end range. Once we are dealing in int[], we have the overhead of a reference to an object anyway, so this seemed okay.

If we really wanted to fix this, we'd likely need to make the finder thingie itself stateful and have getters on that. That would probably be better, but can be an activity for a later day.

@cheddar cheddar merged commit 706b8a0 into apache:master Jan 24, 2023
@imply-cheddar imply-cheddar deleted the pausable-operators branch January 24, 2023 04:52
abhagraw pushed a commit to abhagraw/druid that referenced this pull request Feb 8, 2023
* Adjust Operators to be Pausable

This enables "merge" style operations that
combine multiple streams.

This change includes a naive implementation
of one such merge operator just to provide
concrete evidence that the refactoring is
effective.
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants