Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New output buffer implementation #5002

Merged
merged 19 commits into from
Jul 7, 2016
Merged

Conversation

dain
Copy link
Contributor

@dain dain commented Apr 12, 2016

No description provided.

@dain
Copy link
Contributor Author

dain commented Apr 14, 2016

Note: Github, in its infinite wisdom, has shuffled the commit order.

@dain
Copy link
Contributor Author

dain commented Apr 22, 2016

@haozhun can you review "Add new PartitionedBuffer" and "Add new BroadcastBuffer"

@dain dain assigned haozhun and unassigned nileema Apr 22, 2016

public static OutputBuffers createInitialEmptyOutputBuffers()
{
return new OutputBuffers(0, false, ImmutableMap.<TaskId, Integer>of());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ImmutableMap.of() will just work

@dain
Copy link
Contributor Author

dain commented Jun 23, 2016

I renamed the classes to make it clear which ones are specific to the old SharedBuffer code and which ones are generic. I also named OutputBuffer implementations to end with "OutputBuffer" for clarity.

  • Rename SharedBufferInfo to OutputBufferInfo
  • Rename SharedBuffer to SharedOutputBuffer
  • Rename SharedBufferMemoryManager to OutputBufferMemoryManager

I reworked the URI and BufferId management logic in the query scheduler. The key insight was that I could force the partition number and the taskId to be the same for partitioned outputs, with some minor changes. With that, I didn't need most of my changes, so the old code is restored:

  • Guarantee task id and output partition are the same
  • Remove unused partition from OutputBufferManager
  • Remove unused partition from RemoteTask
  • Declare out buffers immediately for non-broadcast stages

Finally, when working on the new BroadcastOutputBuffer, I discovered that I could reuse the internal PartitionedOutputBuffer.Partition object in BroadcastOutputBuffer by adding reference counting. I pulled out this class into a ClientBuffer and added tests for the client protocol. I'll squash the first two commits:

  • Extract PartitionedOutputBuffer.Partition into ClientBuffer
  • Add reference counting to ClientBuffer
  • Add new BroadcastOutputBuffer

@haozhun
Copy link
Contributor

haozhun commented Jun 28, 2016

  • Looks good
    • Move SharedBuffer and related classes to new package
    • Add OutputBuffers.createInitialEmptyOutputBuffers
    • Rename SharedBufferInfo to OutputBufferInfo
    • Rename SharedBuffer to SharedOutputBuffer
    • Rename SharedBufferMemoryManager to OutputBufferMemoryManager
    • Fix thread safety issue in BroadcastOutputBufferManager
    • Change TaskId.id to be an int
    • Replace TaskId in output buffers with a generic OutputBufferId
    • Remove unused partition from OutputBufferManager
    • Remove unused partition from RemoteTask
    • Register output buffers before adding splits
    • Add LazyOutputBuffer to delay creation until plan is sent
    • Add buffer type to OutputBuffers to allow for different buffer implementations
    • Extract PartitionedOutputBuffer.Partition into ClientBuffer
  • Extract interface OutputBuffer from SharedBuffer
    • Looks good, except comments below
    • OutputBuffer.setNoMorePages javadoc: ignored
  • Guarantee task id and output partition are the same
    • Looks good
    • I would like to understand callers of SqlStageExecution.scheduleTask/Splits better
  • Declare out buffers immediately for non-broadcast stages
    • Looks good, except minor comments below
    • PartitionedOutputBufferManager.addOutputBuffers
      • Validate that noMoreBuffer does not change from true to false, unless there's a valid use case to do that.
      • Calling withBuffer a bunch of times will allocate a bunch of objects and create the version a lot. I guess these don't really matter. But I would create a map and then pass the map in just to be on the safe side.
  • Move shared buffer not-full notification to a new thread
    • SharedOutputBuffer: duplicate requireNonNull for maxBufferSize and systemMemoryUsageListener
    • Question about OutputBufferMemoryManager:
      • Is there any particular reason updateMemoryUsage submit the job outside the sync block and setNoBlockOnFull submit the job inside the sync block?
  • Add new PartitionedOutputBuffer
    • Why is calling checkFlushComplete necessary/useful in setNoMorePages or the constructor?
  • Add reference counting to ClientBuffer
    • In destroy, pendingRead should be completed with emptyResults(taskInstanceId, sequenceId, true)
    • In PageReference.addReference, checkState(referenceCount.getAndIncrement() == 0, "...")
    • I didn't know about FieldAccessNotGuarded, that's super nice. Let's add GuardedBy to currentSequenceId and destroyed
  • Add new BroadcastOutputBuffer
    • enqueue not thread safe: gap before safeGetBuffersSnapshot can mean duplicate pages.
    • Discussion item: noMoreBuffers: put dereferencePage in finally. Other usages of dereferencePage may also benefit if they get put in a finally block. Should we do that? Or should we care at all?

Note: Please squash "Extract PartitionedOutputBuffer.Partition into ClientBuffer". For these multi-threaded code, I find it easier to reason about correctness when I have the full picture.

@dain
Copy link
Contributor Author

dain commented Jul 6, 2016

* Declare out buffers immediately for non-broadcast stages
  * PartitionedOutputBufferManager.addOutputBuffers
    * Validate that noMoreBuffer does not change from `true` to `false`, unless there's a valid use case to do that.
    * Calling `withBuffer` a bunch of times will allocate a bunch of objects and create the `version` a lot. I guess these don't really matter. But I would create a map and then pass the map in just to be on the safe side.

The noMoreBuffer check is handled by the validate code, and the object allocation is not really a problem when you consider that Map will allocate an object for each entry anyway (also partitions is typically small).

The real issues is with the version number which the checkValidTransition code validates. I just rewrote the code to check the buffers directly and avoided the checkValidTransition code.

* Move shared buffer not-full notification to a new thread
  * Question about `OutputBufferMemoryManager`:
    * Is there any particular reason `updateMemoryUsage` submit the job outside the sync block and `setNoBlockOnFull` submit the job inside the sync block?

No, I simplified the code.

* Add new PartitionedOutputBuffer
  * Why is calling `checkFlushComplete` necessary/useful in `setNoMorePages` or the constructor?

setNoMorePages can free a reader which will see the finished flag. Then there is a race to destroy the buffers. It is simpler to just double check at the end of the method, then try to determine if this is a benign race. As for the constructor, it is likely not needed, but doesn't hurt anything (potentially, you could have zero partitions, but that seems weird).

* Add reference counting to ClientBuffer
  * In `destroy`, `pendingRead` should be completed with `emptyResults(taskInstanceId, sequenceId, true)`

This doesn't matter. There is only one client and the destroy comes from the client. This has the benefit of keeping the API surface area smaller.

@dain
Copy link
Contributor Author

dain commented Jul 6, 2016

I will merge once the current release goes out

@dain dain added the accepted label Jul 6, 2016
@dain dain unassigned haozhun Jul 6, 2016
@dain dain merged commit 800df76 into prestodb:master Jul 7, 2016
@dain dain deleted the new-output-buffers branch July 7, 2016 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants