Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement event batching in the miniframework #143

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Dec 3, 2020

Introduce a new way of passing information from acquire to produce:

  • add a template type parameter to the EDProducerExternalWork base class, and make it available to inherited classes as AsyncState;
  • acquire should use an AsyncState object hold all the information that needs to be passed to produce: acquire returns an AsyncState object that is later passed to produce; it is also marked as const method to guarantee that the information is not passed through some internal state;
  • produce takes an extra AsyncState parameter to receive the state information from acquire;
  • if a Producer does not need to pass any state from acquire to produce it can use a void or empty template parameter; in this case acquire returns void and produce does not take any new parameters (as before); acquire is still marked as const.

This change is propaedeutic to the event batching support, but can be reviewed and adopted independently of it.

Add a doneWaiting overload that does not take an exception pointer argument

This is just a simplification for the cases where doneWaiting(...) was being called with an empty exception pointer.

This change is independent from event batching support, and can be reviewed and adopted independently of it

Implement basic support for event batching, trying to have minimal impact on the existing modules:

  • the Source is responsible for reading (up to) N events at a time;
  • the Events are stored in an EventBatch and passed to each Worker as an EventRange (for produce()) or ConstEventRange (for acquire); the (Const)EventRange class allows (const) iteration and a (const) range-loop over the events, without any possibility of adding/dropping events;
  • the Worker passes the events to the various EDProducer-like base classes in the same way;
  • the existing EDProducer and EDProducerExternalWork base classes loop over the events in the range, and call acquire and produce one event at a time; modules that derive from them can be used without any changes;
  • introduce two new EDBatchingProducer and EDBatchingProducerExternalWork base classes that pass the range of Events to the module's acquire and produce methods; the module's implementation can then loop over the events, process them in parallel, etc.

The framework tests have been reorganised and extended to test all four EDProdcer-like base classes:

  • EDProducer;
  • EDProducerExternalWork;
  • EDBatchingProducer;
  • EDBatchingProducerExternalWork.

Notes

About edm::EventRange and edm::ConstEventRange

The current edm::EventRange implementation is suitable only for the case where all modules will process all events; once the possibility of filtering events or to re-aggregate the ranges are considered, a different implementation will be needed.
The interface can mostly stay unchanged, but these objects might better be passed by reference rather than by value.

Alternative approaches for acquire()

The approach used by src/fwtest/Framework/EDProducer.h requires that the AsyncState type by moveable or cpyable.

A different approach is used by src/cuda/Framework/EDProducer.h: instead of acquire returning a new AsyncState object by value, the AsyncState object is default-constructed by the EDProducerExternalWork base class and passed to acquire by AsyncState &, making the interface closer to that of produce().
This waves the requirement that AsyncState is moveable or copyable, but adds the requirement that AsyncState be default-constructible.

A further possibility would be for the base class to only hold an std::unique_ptr<AsyncState> or std::optional<AsyncState> and pass it by reference to acquire().
This would wave also the requirement that AsyncState is default-constructible; it would be responsibility of the client code to properly construct an object during acquire() and assign it to the unique_ptr or optional passed to it.
An other downside would be that the interfaces of acquire() and produce() would diverge again.

@fwyzard fwyzard requested a review from makortel December 3, 2020 10:19
@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 4, 2020

OK, this looks usable now.

The next step will be to try and make the CUDA implementation use it ...

@makortel
Copy link
Collaborator

makortel commented Dec 4, 2020

The next step will be to try and make the CUDA implementation use it ...

It will take me some time to digest this. Organizationally I'd prefer if this version of fwtest would be made as a separate program (it's kind of a branching point), because I'm expecting eventually to have a multiple attempts of batching while it could be useful to have many/all of them around. I'll anyway go though this PR as it is.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 4, 2020

Organizationally I'd prefer if this version of fwtest would be made as a separate program

Sure. Making the PR on top of the existing fwtest makes it easier to see all the changes, but if/when we decide to merge it it can go in a different directory.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 5, 2020

I'm now looking how to apply this to the cuda code base.

The first problem I ran into is that the AsyncState I need to define for some external work producers is not copyable or movable, because it includes a cms::cuda::ContextState.

Just to show how much I didn't know about c++:

  • even when return value optimisation is mandated by the standard, a non-deleted move or copy constructor is still required
  • std::vector<T>::resize() and std::vector<T>::reserve() require that T is copyable or movable (even when no reallocation is performed)

So I'm using a third approach:

  • the AsyncStates are held in a std::optional<AsyncState> or a std::unique_ptr<AsyncSytate[]> (std::vector<std:optional<AsyncState>> could also work)
  • they are recreated for every event or batch of events
  • they are passed by AsyncState & both to acquire() and produce()
  • they are reset after each event or batch of events

With this approach the only requirement on AsyncState is that it can be default constructed; I think this should be fine, since using a data member to hold some state between acquire() and produce() probably has the same requirement.
If this requirement is also a problem, we could pass an empty std::optional<AsyncState> & to acquire() and let the client code construct a value in place.
This would likely make the acquire() and produce() interfaces diverge, though.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2020

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

This avoids it:

diff --git a/src/cuda/Framework/EDProducer.h b/src/cuda/Framework/EDProducer.h
index 17e7e1be1bdb..76e39e93bf62 100644
--- a/src/cuda/Framework/EDProducer.h
+++ b/src/cuda/Framework/EDProducer.h
@@ -38,27 +38,25 @@ namespace edm {
     bool hasAcquire() const { return true; }
 
     void doAcquire(Event const& event, EventSetup const& eventSetup, WaitingTaskWithArenaHolder holder) {
-      state_.reset();
-      state_.emplace();
-      acquire(event, eventSetup, std::move(holder), *state_);
+      acquire(event, eventSetup, std::move(holder), state_);
     }
 
     void doProduce(Event& event, EventSetup const& eventSetup) {
-      produce(event, eventSetup, *state_);
-      state_.reset();
+      produce(event, eventSetup, state_);
     }
 
     virtual void acquire(Event const& event,
                          EventSetup const& eventSetup,
                          WaitingTaskWithArenaHolder holder,
                          AsyncState& state) const = 0;
+
     virtual void produce(Event& event, EventSetup const& eventSetup, AsyncState& state) = 0;
 
     void doEndJob() { endJob(); }
     virtual void endJob() {}
 
   private:
-    std::optional<AsyncState> state_;
+    AsyncState state_;
   };
 
   template <>

I've propagated this change to the branch and pull request, though I'm still not sure that's the best approach.

Add a template type parameter to the edm::EDProducerExternalWork class.

The acquire() method is now const, and it communicates its asynchronous
state to the the produce() method via an object of this type, rather
than using data members.

Implementations that do not need to pass any state between acquire() and
produce() can use "void" or leave the template parameter empty.
Reorganise the existing EDProducer tests, and add tests for
  - EDBatchingProducer
  - EDBatchingProducerExternalWork
Introduce three new edm classes:

  - edm::EventBatch
    implements the ownership of a batch of events, implemented as an
    std::vector<edm::Event>; used by the Source to provide events to the
    StreamSchedule.

  - edm::EventRange
    implements non-owning, non-const access to a batch of events;
    wraps a pair of Event* to the begin and end of an event batch;
    passed (by value) to doProduce() and produce().

  - edm::ConstEventRange
    implements non-owning const access to a batch of events;
    wraps a pair of Event* to the begin and end of an event batch;
    passed (by value) to doAcquire() and acquire().
Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable approach for a fixed-size batching and having total number of concurrent events to be "batch size * number of streams".

I found only two rather nitpicking-level comments (which may mean that I didn't digest all in the end). I would not worry too much about the interfaces (as long as they are reasonable) before performance is compared against the current non-batched approach.


void doProduce(EventRange events, EventSetup const& eventSetup) { produce(events, eventSetup, state_); }

virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& states) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor, but is this a typo?

Suggested change
virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& states) = 0;
virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& state) = 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used both spellings while prototyping, so I don't mind either way: this should capture the "state" or "states" of multiple events:

  • state is more consistent with AsyncState
  • states is more consistent with events

Re-thinking a bit more... maybe state is more correct, since it is one object, not e.g. a vector<AsyncState>.

Comment on lines +26 to +28
EventRange range() { return EventRange(&events_.front(), &events_.back() + 1); }

ConstEventRange range() const { return ConstEventRange(&events_.front(), &events_.back() + 1); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very relevant for prototyping itself, but i think being able to use something along

Suggested change
EventRange range() { return EventRange(&events_.front(), &events_.back() + 1); }
ConstEventRange range() const { return ConstEventRange(&events_.front(), &events_.back() + 1); }
EventRange range() { return EventRange(events_.begin(), events_.end()); }
ConstEventRange range() const { return ConstEventRange(events_.begin(), events_.end()); }

would be clearer to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't do that was to avoid making EventRange and ConstEventRange use std::vector<Event>::iterator and const_iterator: then it could only work for a range of events stored in a vector, while using plain pointers it can work with any consecutive range of events.

An alternative could be to add a constructor like

template <typename IT>
EventRange(IT begin, IT end) : begin_(std::addressof(*begin)), end_(std::addressof((*end)) { ... }

but then one could be tempted to pass non-contiguous iterators.
Maybe one could add something like

  assert((end - begin) == (end_ - begin_));

or, once we move to C++20 with concepts, we could use contiguous_iterator ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of the prototype the current pointers are fine for me. With C++20 maybe even passing the range as std::span could be an option.

@makortel
Copy link
Collaborator

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

Interesting. The AsyncState variables are essentially bunch of smart pointers and numbers (and one map<unsigned, vector<SiPixelRawDataError>>), is their construction and destruction really that expensive? (I'm probably missing something)

@makortel
Copy link
Collaborator

The first problem I ran into is that the AsyncState I need to define for some external work producers is not copyable or movable, because it includes a cms::cuda::ContextState.

For prototyping purposes I think making cms::cuda::ContextState e.g. movable is perfectly fine. The only reason for denying moving is to catch unintended usage early.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 15, 2020

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

Interesting. The AsyncState variables are essentially bunch of smart pointers and numbers (and one map<unsigned, vector<SiPixelRawDataError>>), is their construction and destruction really that expensive? (I'm probably missing something)

I was surprised as well, and digging into the cause I found it to be SiPixelRawToClusterCUDA that needs to pass a pixelgpudetails::SiPixelRawToClusterGPUKernel::WordFedAppender from acquire() to produce().
WordFedAppender has two cms::cuda::host::noncached::unique_ptr<unsigned int[]> and cms::cuda::host::noncached::unique_ptr<unsigned char[]> data members, so re-creating it for every event becomes prohibitely expensive.

I guess we could find a workaround - but avoiding constructing and destructing AsyncState on every event seemed the simplest solution.

@makortel
Copy link
Collaborator

I was surprised as well, and digging into the cause I found it to be SiPixelRawToClusterCUDA that needs to pass a pixelgpudetails::SiPixelRawToClusterGPUKernel::WordFedAppender from acquire() to produce().
WordFedAppender has two cms::cuda::host::noncached::unique_ptr<unsigned int[]> and cms::cuda::host::noncached::unique_ptr<unsigned char[]> data members, so re-creating it for every event becomes prohibitely expensive.

Ah, WordFedAppender, missed that. Right, that one has been intended to be "allocate once per stream and re-use". On the other hand, it would also be straightforward to change it (back) to use the caching allocator given the tiny impact of cudaHostAllocWriteCombined. Counter-OTOH, it could also be useful to keep it around as it is to have a use case for a stream-object that should not be constructed for each event.

@makortel
Copy link
Collaborator

Just to repeat here what I mentioned in the core sw meeting today. I think an object that should live longer than "from acquire() to produce()" (e.g. WordFedAppender currently) should be stored as a stream module's member variable (or in StreamCache of a global module).

And I think I finally realized the issue with that (within this PR)

  • EDProducerExternalWork<T> calls first the acquire() for all events in the batch, and only then the produce() for all events in the batch
  • EDBatchingProducerExternalWork<T> calls the acquire() once for the full batch, and then the produce() for the full batch

Therefore EDBatchingProducerExternalWork<T> can use the stream member variable/cache to pass something from acquire() to produce(), because it has to deal with the batching explicitly anyway. But EDProducerExternalWork<T> can not, because the values written in acquire() call of the first event of the batch would be overwritten by the call of second event of the batch etc. And that would effectively necessitate the concept of "event cache" in addition to "stream cache", because a stream would no longer be a stream of events but a stream of event batches, even if we would solve cms-sw/cmssw#24186.

Would it be fair to say that the AsyncState in this PR is effectively such an "event cache"?

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 15, 2020

Would it be fair to say that the AsyncState in this PR is effectively such an "event cache"?

Even if that is not the goal of the interface, in practice yes, because it is available (and mutable) in both acquire() and produce() - which are the only event processing methods.

@makortel
Copy link
Collaborator

Given that this batching approach needs an "event cache" in any case, how important do you think a separate mechanism to pass data from acquire() to produce() (cms-sw/cmssw#24186) would be?

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 17, 2020

I have the feeling that an "event cache" and a "pass data from acquire to produce" mechanism could be to some extent orthogonal:

  • if we have acquire-to-produce, most existing ExternalWork modules would not need an event cache to support batching
  • even without ExternalWork, modules may need some kind of event cache to support batching - be it something done explicitly by the module or provided by the framework

If we introduce an event cache, and if the interface for the event cache and the acquire-to-produce data would be the same (i.e. additional template parameter, additional parameter passed to acquire and produce) then of course having both at the same tie would only add complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants