Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor exchange interfaces #13968

Merged
merged 6 commits into from
Sep 14, 2022

Conversation

arhimondr
Copy link
Contributor

Description

See commit messages for more details

Is this change a fix, improvement, new feature, refactoring, or other?

Refactoring

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Core engine

How would you describe this change to a non-technical end user or system administrator?

N/A

Related issues, pull requests, and links

Documentation

(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(X) No release notes entries required.
( ) Release notes entries required with the following suggested text:

@arhimondr
Copy link
Contributor Author

On top of #13945

noMoreSplitsTracker.noMoreOperators();
if (noMoreSplitsTracker.isNoMoreSplits()) {
if (exchangeDataSource != null) {
exchangeDataSource.noMoreInputs();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be called multiple times due to races. Is that a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (DriverFactory driverFactory : localExecutionPlan.getDriverFactories()) {
Optional<PlanNodeId> sourceId = driverFactory.getSourceId();
if (sourceId.isPresent() && partitionedSources.contains(sourceId.get())) {
driverRunnerFactoriesWithSplitLifeCycle.put(sourceId.get(), new DriverSplitRunnerFactory(driverFactory, true));
}
else {
driverRunnerFactoriesWithTaskLifeCycle.add(new DriverSplitRunnerFactory(driverFactory, false));
DriverSplitRunnerFactory runnerFactory = new DriverSplitRunnerFactory(driverFactory, false);
sourceId.ifPresent(planNodeId -> driverRunnerFactoriesWithRemoteSource.put(planNodeId, runnerFactory));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I do not fully understand that. Does it mean that if sourceId is in partitionedSources we are not reading from remote source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the logic today. Basically "partitionedSources" is what we call table scans. Other sources are remote sources that are run with FIXED distribution

@@ -249,6 +249,13 @@ public OutputBuffers withNoMoreBufferIds()
return new OutputBuffers(type, version + 1, true, buffers, exchangeSinkInstanceHandle);
}

public OutputBuffers withExchangeSinkInstanceHandle(ExchangeSinkInstanceHandle updatedExchangeSinkInstanceHandle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withUpdatedExchangeSinkInstanceHandle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not called yet. To be added separatelly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me drop it for now

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. Some minor comments and questions when I do not fully understand.

@arhimondr arhimondr force-pushed the refactor-exchange-interfaces branch from 5949398 to 1b801c7 Compare September 6, 2022 20:42
@arhimondr
Copy link
Contributor Author

On top of #13978

@sopel39
Copy link
Member

sopel39 commented Sep 6, 2022

What's the goal of the refactor? Does it have change to introduce regressions/conjestions? Are benchmarks required?

Please mind that #13463 should probably land first (as it was started earlier)

@arhimondr
Copy link
Contributor Author

@sopel39 This refactor makes Exchange interfaces more flexible allowing integration with more advanced exchange implementations

return (getUtilization() > 0.5) && stateMachine.getState().canAddPages();
// do not grab lock to acquire outputBuffers to avoid delaying TaskStatus response
return OutputBufferStatus.builder(outputBuffers.getVersion())
.setOverutilized(memoryManager.getUtilization() >= 0.5 || !stateMachine.getState().canAddPages())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes logic - new version looks ok. Was there a bug? Looks like fix itself should be a separate commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let me revert it to what it were. Changing the logic wasn't an intention I had.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note errorprone failures

@arhimondr arhimondr force-pushed the refactor-exchange-interfaces branch from 1b801c7 to 753c034 Compare September 7, 2022 18:05
@arhimondr arhimondr force-pushed the refactor-exchange-interfaces branch 2 times, most recently from b407f58 to a208556 Compare September 8, 2022 17:59
long readerFileSize = 0;
while (!files.isEmpty()) {
ExchangeSourceFile file = files.peek();
if (readerFileSize == 0 || readerFileSize + file.getFileSize() <= maxPageStorageSize + exchangeStorage.getWriteBufferSize()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readerFileSize + file.getFileSize() <= maxPageStorageSize + exchangeStorage.getWriteBufferSize()

What do we need this? I don't get it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To parallelize reads. Basically I don't want to add more files to a reader than necessary to let the other readers do the processing in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand. Shouldn't we let all readers try to poll from the queue like what we used to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to implement it the way it was (with the pull based model) but then the flow is becoming significantly more complex, as the engine delivers splits through push model. I looked at the reader implementation and thought that there's little reason to reuse the readers. A new buffer is allocated any way, and as long as we provide enough files to fill the entire buffer parallelism shouldn't be impacted. Do you see any potential issues with the new approach?

Comment on lines +591 to +593
catch (TimeoutException e) {
updateSinkInstanceHandleIfNecessary();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the logic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an ExchangeSink is blocked there's a change that the handle needs to be updated. This waits for some amount of time, and if the ExchangeSink is not unblocked - we check if the handle needs to get updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we refer to a handle update, it means that we switch to a different buffer node to write to? But we only switch to a different buffer node when the old buffer node is offline or is being drained, right? How does it related to how much time we are blocked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. The idea is that when an ExchangeSink needs a handle update it will get blocked as it will no longer able to accept new data. Do you see any use case when we would like to refresh the handle while the ExchangeSink is still able to accept some data?

Comment on lines +190 to +199
Closer closer = Closer.create();
for (ExchangeStorageReader reader : readers.getAndSet(ImmutableList.of())) {
closer.register(reader);
}
try {
closer.close();
}
catch (IOException e) {
throw new UncheckedIOException(e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using Closer make a difference at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safer as it makes sure every object is closed even if one of the close method threw and exception.

@arhimondr arhimondr force-pushed the refactor-exchange-interfaces branch from a208556 to 6907088 Compare September 9, 2022 18:59
Preparation for introduction of ExchangeSinkInstanceHandle refresh.
When ExchangeSinkInstanceHandle can be refreshed it is no longer clear
what instance of the ExchangeSinkInstanceHandle should be passed to the
finish method.
@arhimondr arhimondr force-pushed the refactor-exchange-interfaces branch from 6907088 to 0accd89 Compare September 9, 2022 19:38
Copy link
Contributor Author

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linzebing updated

@arhimondr
Copy link
Contributor Author

arhimondr commented Sep 14, 2022

Benchmark results:

Overall:

+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
|          16931434280 |              80033442 |          16947868026 |              80453499 |  1.00097 |   1.00525 |
+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+

Suites:

+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| suite                         | base_cpu_time_millis | base_wall_time_millis | test_cpu_time_millis | test_wall_time_millis | cpu_diff | wall_diff |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+
| tpcds_sf10000_partitioned     |           2120577324 |              12409639 |           2118528584 |              12529306 |  0.99903 |   1.00964 |
| tpcds_sf10000_partitioned_etl |          11426752912 |              42445241 |          11444948041 |              42784597 |  1.00159 |   1.00800 |
| tpcds_sf100_partitioned       |             25010870 |               3544815 |             24669936 |               3622792 |  0.98637 |   1.02200 |
| tpcds_sf100_partitioned_etl   |            113428024 |               7388384 |            113435708 |               7671437 |  1.00007 |   1.03831 |
| tpch_sf10000_bucketed         |            816035318 |               3238805 |            811307727 |               3249298 |  0.99421 |   1.00324 |
| tpch_sf10000_bucketed_etl     |           2402861533 |               9420345 |           2408826569 |               9087724 |  1.00248 |   0.96469 |
| tpch_sf100_bucketed           |              6338395 |                198803 |              6410055 |                198844 |  1.01131 |   1.00021 |
| tpch_sf100_bucketed_etl       |             20429904 |               1387410 |             19741406 |               1309501 |  0.96630 |   0.94385 |
+-------------------------------+----------------------+-----------------------+----------------------+-----------------------+----------+-----------+

Detailed: https://gist.github.com/arhimondr/7a116dc95ce63a7b62ec82b0df6123c5

No major difference in CPU and Wall timings

@arhimondr arhimondr merged commit a992729 into trinodb:master Sep 14, 2022
@arhimondr arhimondr deleted the refactor-exchange-interfaces branch September 14, 2022 18:14
@github-actions github-actions bot added this to the 396 milestone Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants