Implement event batching in the miniframework #143

fwyzard · 2020-12-03T10:05:05Z

Introduce a new way of passing information from `acquire` to `produce`:

add a template type parameter to the EDProducerExternalWork base class, and make it available to inherited classes as AsyncState;
acquire should use an AsyncState object hold all the information that needs to be passed to produce: acquire returns an AsyncState object that is later passed to produce; it is also marked as const method to guarantee that the information is not passed through some internal state;
produce takes an extra AsyncState parameter to receive the state information from acquire;
if a Producer does not need to pass any state from acquire to produce it can use a void or empty template parameter; in this case acquire returns void and produce does not take any new parameters (as before); acquire is still marked as const.

This change is propaedeutic to the event batching support, but can be reviewed and adopted independently of it.

Add a `doneWaiting` overload that does not take an exception pointer argument

This is just a simplification for the cases where doneWaiting(...) was being called with an empty exception pointer.

This change is independent from event batching support, and can be reviewed and adopted independently of it

Implement basic support for event batching, trying to have minimal impact on the existing modules:

the Source is responsible for reading (up to) N events at a time;
the Events are stored in an EventBatch and passed to each Worker as an EventRange (for produce()) or ConstEventRange (for acquire); the (Const)EventRange class allows (const) iteration and a (const) range-loop over the events, without any possibility of adding/dropping events;
the Worker passes the events to the various EDProducer-like base classes in the same way;
the existing EDProducer and EDProducerExternalWork base classes loop over the events in the range, and call acquire and produce one event at a time; modules that derive from them can be used without any changes;
introduce two new EDBatchingProducer and EDBatchingProducerExternalWork base classes that pass the range of Events to the module's acquire and produce methods; the module's implementation can then loop over the events, process them in parallel, etc.

The framework tests have been reorganised and extended to test all four EDProdcer-like base classes:

EDProducer;
EDProducerExternalWork;
EDBatchingProducer;
EDBatchingProducerExternalWork.

Notes

About `edm::EventRange` and `edm::ConstEventRange`

The current edm::EventRange implementation is suitable only for the case where all modules will process all events; once the possibility of filtering events or to re-aggregate the ranges are considered, a different implementation will be needed.
The interface can mostly stay unchanged, but these objects might better be passed by reference rather than by value.

Alternative approaches for `acquire()`

The approach used by src/fwtest/Framework/EDProducer.h requires that the AsyncState type by moveable or cpyable.

A different approach is used by src/cuda/Framework/EDProducer.h: instead of acquire returning a new AsyncState object by value, the AsyncState object is default-constructed by the EDProducerExternalWork base class and passed to acquire by AsyncState &, making the interface closer to that of produce().
This waves the requirement that AsyncState is moveable or copyable, but adds the requirement that AsyncState be default-constructible.

A further possibility would be for the base class to only hold an std::unique_ptr<AsyncState> or std::optional<AsyncState> and pass it by reference to acquire().
This would wave also the requirement that AsyncState is default-constructible; it would be responsibility of the client code to properly construct an object during acquire() and assign it to the unique_ptr or optional passed to it.
An other downside would be that the interfaces of acquire() and produce() would diverge again.

fwyzard · 2020-12-04T00:37:55Z

OK, this looks usable now.

The next step will be to try and make the CUDA implementation use it ...

makortel · 2020-12-04T00:53:11Z

The next step will be to try and make the CUDA implementation use it ...

It will take me some time to digest this. Organizationally I'd prefer if this version of fwtest would be made as a separate program (it's kind of a branching point), because I'm expecting eventually to have a multiple attempts of batching while it could be useful to have many/all of them around. I'll anyway go though this PR as it is.

fwyzard · 2020-12-04T00:57:39Z

Organizationally I'd prefer if this version of fwtest would be made as a separate program

Sure. Making the PR on top of the existing fwtest makes it easier to see all the changes, but if/when we decide to merge it it can go in a different directory.

fwyzard · 2020-12-05T01:09:35Z

I'm now looking how to apply this to the cuda code base.

The first problem I ran into is that the AsyncState I need to define for some external work producers is not copyable or movable, because it includes a cms::cuda::ContextState.

Just to show how much I didn't know about c++:

even when return value optimisation is mandated by the standard, a non-deleted move or copy constructor is still required
std::vector<T>::resize() and std::vector<T>::reserve() require that T is copyable or movable (even when no reallocation is performed)

So I'm using a third approach:

the AsyncStates are held in a std::optional<AsyncState> or a std::unique_ptr<AsyncSytate[]> (std::vector<std:optional<AsyncState>> could also work)
they are recreated for every event or batch of events
they are passed by AsyncState & both to acquire() and produce()
they are reset after each event or batch of events

With this approach the only requirement on AsyncState is that it can be default constructed; I think this should be fine, since using a data member to hold some state between acquire() and produce() probably has the same requirement.
If this requirement is also a problem, we could pass an empty std::optional<AsyncState> & to acquire() and let the client code construct a value in place.
This would likely make the acquire() and produce() interfaces diverge, though.

fwyzard · 2020-12-06T18:42:48Z

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

This avoids it:

diff --git a/src/cuda/Framework/EDProducer.h b/src/cuda/Framework/EDProducer.h
index 17e7e1be1bdb..76e39e93bf62 100644
--- a/src/cuda/Framework/EDProducer.h
+++ b/src/cuda/Framework/EDProducer.h
@@ -38,27 +38,25 @@ namespace edm {
     bool hasAcquire() const { return true; }
 
     void doAcquire(Event const& event, EventSetup const& eventSetup, WaitingTaskWithArenaHolder holder) {
-      state_.reset();
-      state_.emplace();
-      acquire(event, eventSetup, std::move(holder), *state_);
+      acquire(event, eventSetup, std::move(holder), state_);
     }
 
     void doProduce(Event& event, EventSetup const& eventSetup) {
-      produce(event, eventSetup, *state_);
-      state_.reset();
+      produce(event, eventSetup, state_);
     }
 
     virtual void acquire(Event const& event,
                          EventSetup const& eventSetup,
                          WaitingTaskWithArenaHolder holder,
                          AsyncState& state) const = 0;
+
     virtual void produce(Event& event, EventSetup const& eventSetup, AsyncState& state) = 0;
 
     void doEndJob() { endJob(); }
     virtual void endJob() {}
 
   private:
-    std::optional<AsyncState> state_;
+    AsyncState state_;
   };
 
   template <>

I've propagated this change to the branch and pull request, though I'm still not sure that's the best approach.

Add a template type parameter to the edm::EDProducerExternalWork class. The acquire() method is now const, and it communicates its asynchronous state to the the produce() method via an object of this type, rather than using data members. Implementations that do not need to pass any state between acquire() and produce() can use "void" or leave the template parameter empty.

Reorganise the existing EDProducer tests, and add tests for - EDBatchingProducer - EDBatchingProducerExternalWork

Introduce three new edm classes: - edm::EventBatch implements the ownership of a batch of events, implemented as an std::vector<edm::Event>; used by the Source to provide events to the StreamSchedule. - edm::EventRange implements non-owning, non-const access to a batch of events; wraps a pair of Event* to the begin and end of an event batch; passed (by value) to doProduce() and produce(). - edm::ConstEventRange implements non-owning const access to a batch of events; wraps a pair of Event* to the begin and end of an event batch; passed (by value) to doAcquire() and acquire().

…multiple threads

makortel

I think this is a reasonable approach for a fixed-size batching and having total number of concurrent events to be "batch size * number of streams".

I found only two rather nitpicking-level comments (which may mean that I didn't digest all in the end). I would not worry too much about the interfaces (as long as they are reasonable) before performance is compared against the current non-batched approach.

makortel · 2020-12-15T02:19:30Z

src/cuda/Framework/EDProducer.h

+
+ void doProduce(EventRange events, EventSetup const& eventSetup) { produce(events, eventSetup, state_); }
+
+ virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& states) = 0;


Really minor, but is this a typo?

Suggested change

virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& states) = 0;

virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& state) = 0;

I've used both spellings while prototyping, so I don't mind either way: this should capture the "state" or "states" of multiple events:

state is more consistent with AsyncState

states is more consistent with events

Re-thinking a bit more... maybe state is more correct, since it is one object, not e.g. a vector<AsyncState>.

makortel · 2020-12-15T02:22:33Z

src/cuda/Framework/EventBatch.h

+ EventRange range() { return EventRange(&events_.front(), &events_.back() + 1); }
+
+ ConstEventRange range() const { return ConstEventRange(&events_.front(), &events_.back() + 1); }


Not very relevant for prototyping itself, but i think being able to use something along

Suggested change

EventRange range() { return EventRange(&events_.front(), &events_.back() + 1); }

ConstEventRange range() const { return ConstEventRange(&events_.front(), &events_.back() + 1); }

EventRange range() { return EventRange(events_.begin(), events_.end()); }

ConstEventRange range() const { return ConstEventRange(events_.begin(), events_.end()); }

would be clearer to read.

The reason I didn't do that was to avoid making EventRange and ConstEventRange use std::vector<Event>::iterator and const_iterator: then it could only work for a range of events stored in a vector, while using plain pointers it can work with any consecutive range of events.

An alternative could be to add a constructor like

template <typename IT> EventRange(IT begin, IT end) : begin_(std::addressof(*begin)), end_(std::addressof((*end)) { ... }

but then one could be tempted to pass non-contiguous iterators.
Maybe one could add something like

assert((end - begin) == (end_ - begin_));

or, once we move to C++20 with concepts, we could use contiguous_iterator ?

For the purposes of the prototype the current pointers are fine for me. With C++20 maybe even passing the range as std::span could be an option.

makortel · 2020-12-15T02:47:37Z

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

Interesting. The AsyncState variables are essentially bunch of smart pointers and numbers (and one map<unsigned, vector<SiPixelRawDataError>>), is their construction and destruction really that expensive? (I'm probably missing something)

makortel · 2020-12-15T02:49:13Z

The first problem I ran into is that the AsyncState I need to define for some external work producers is not copyable or movable, because it includes a cms::cuda::ContextState.

For prototyping purposes I think making cms::cuda::ContextState e.g. movable is perfectly fine. The only reason for denying moving is to catch unintended usage early.

fwyzard · 2020-12-15T08:44:17Z

The previous implementation incurred in a huge performance penalty, because the AsyncState variables were now constructed and destructed for every event.

Interesting. The AsyncState variables are essentially bunch of smart pointers and numbers (and one map<unsigned, vector<SiPixelRawDataError>>), is their construction and destruction really that expensive? (I'm probably missing something)

I was surprised as well, and digging into the cause I found it to be SiPixelRawToClusterCUDA that needs to pass a pixelgpudetails::SiPixelRawToClusterGPUKernel::WordFedAppender from acquire() to produce().
WordFedAppender has two cms::cuda::host::noncached::unique_ptr<unsigned int[]> and cms::cuda::host::noncached::unique_ptr<unsigned char[]> data members, so re-creating it for every event becomes prohibitely expensive.

I guess we could find a workaround - but avoiding constructing and destructing AsyncState on every event seemed the simplest solution.

makortel · 2020-12-15T15:36:23Z

I was surprised as well, and digging into the cause I found it to be SiPixelRawToClusterCUDA that needs to pass a pixelgpudetails::SiPixelRawToClusterGPUKernel::WordFedAppender from acquire() to produce().
WordFedAppender has two cms::cuda::host::noncached::unique_ptr<unsigned int[]> and cms::cuda::host::noncached::unique_ptr<unsigned char[]> data members, so re-creating it for every event becomes prohibitely expensive.

Ah, WordFedAppender, missed that. Right, that one has been intended to be "allocate once per stream and re-use". On the other hand, it would also be straightforward to change it (back) to use the caching allocator given the tiny impact of cudaHostAllocWriteCombined. Counter-OTOH, it could also be useful to keep it around as it is to have a use case for a stream-object that should not be constructed for each event.

makortel · 2020-12-15T20:49:39Z

Just to repeat here what I mentioned in the core sw meeting today. I think an object that should live longer than "from acquire() to produce()" (e.g. WordFedAppender currently) should be stored as a stream module's member variable (or in StreamCache of a global module).

And I think I finally realized the issue with that (within this PR)

EDProducerExternalWork<T> calls first the acquire() for all events in the batch, and only then the produce() for all events in the batch
EDBatchingProducerExternalWork<T> calls the acquire() once for the full batch, and then the produce() for the full batch

Therefore EDBatchingProducerExternalWork<T> can use the stream member variable/cache to pass something from acquire() to produce(), because it has to deal with the batching explicitly anyway. But EDProducerExternalWork<T> can not, because the values written in acquire() call of the first event of the batch would be overwritten by the call of second event of the batch etc. And that would effectively necessitate the concept of "event cache" in addition to "stream cache", because a stream would no longer be a stream of events but a stream of event batches, even if we would solve cms-sw/cmssw#24186.

Would it be fair to say that the AsyncState in this PR is effectively such an "event cache"?

fwyzard · 2020-12-15T21:14:42Z

Would it be fair to say that the AsyncState in this PR is effectively such an "event cache"?

Even if that is not the goal of the interface, in practice yes, because it is available (and mutable) in both acquire() and produce() - which are the only event processing methods.

makortel · 2020-12-17T14:23:59Z

Given that this batching approach needs an "event cache" in any case, how important do you think a separate mechanism to pass data from acquire() to produce() (cms-sw/cmssw#24186) would be?

fwyzard · 2020-12-17T14:55:47Z

I have the feeling that an "event cache" and a "pass data from acquire to produce" mechanism could be to some extent orthogonal:

if we have acquire-to-produce, most existing ExternalWork modules would not need an event cache to support batching
even without ExternalWork, modules may need some kind of event cache to support batching - be it something done explicitly by the module or provided by the framework

If we introduce an event cache, and if the interface for the event cache and the acquire-to-produce data would be the same (i.e. additional template parameter, additional parameter passed to acquire and produce) then of course having both at the same tie would only add complexity.

fwyzard requested a review from makortel December 3, 2020 10:19

makortel added the batching label Dec 4, 2020

fwyzard force-pushed the batching branch from 41fa83e to 2d34979 Compare December 4, 2020 22:56

fwyzard force-pushed the batching branch from d043ca3 to 76393a7 Compare December 6, 2020 09:06

Add a doneWaiting overload without the exception argument

64f574c

fwyzard force-pushed the batching branch from 76393a7 to 1ddc75c Compare December 7, 2020 21:00

fwyzard added 8 commits December 7, 2020 22:03

Implement simple support for event batching

5adaa3d

Add EDBatchingProducer and EDBatchingProducerExternalWork base classes

a8ce518

Implement tests for all kind of EDProducer

5fc49aa

Reorganise the existing EDProducer tests, and add tests for - EDBatchingProducer - EDBatchingProducerExternalWork

Avoid potential overrun while reading multiple events per batch with …

8594f86

…multiple threads

Copy event batching support from fwtest to cuda

2b0de20

Adapt the EDProducerExternalWork modules to the new interface

8070c2c

fwyzard force-pushed the batching branch from 1ddc75c to 8070c2c Compare December 7, 2020 22:39

makortel reviewed Dec 15, 2020

View reviewed changes

fwyzard mentioned this pull request Aug 21, 2021

Improve CUDA EDProducers cms-sw/cmssw#30266

Open

asubah mentioned this pull request Jun 17, 2022

Implement event batching in the miniframework #364

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement event batching in the miniframework #143

Implement event batching in the miniframework #143

fwyzard commented Dec 3, 2020 •

edited

Loading

fwyzard commented Dec 4, 2020

makortel commented Dec 4, 2020

fwyzard commented Dec 4, 2020

fwyzard commented Dec 5, 2020 •

edited

Loading

fwyzard commented Dec 6, 2020 •

edited

Loading

makortel left a comment

makortel Dec 15, 2020

fwyzard Dec 15, 2020

makortel Dec 15, 2020

fwyzard Dec 15, 2020

makortel Dec 15, 2020

makortel commented Dec 15, 2020

makortel commented Dec 15, 2020

fwyzard commented Dec 15, 2020

makortel commented Dec 15, 2020

makortel commented Dec 15, 2020

fwyzard commented Dec 15, 2020

makortel commented Dec 17, 2020

fwyzard commented Dec 17, 2020


		void doProduce(EventRange events, EventSetup const& eventSetup) { produce(events, eventSetup, state_); }

		virtual void produce(EventRange events, EventSetup const& eventSetup, AsyncState& states) = 0;

		EventRange range() { return EventRange(&events_.front(), &events_.back() + 1); }

		ConstEventRange range() const { return ConstEventRange(&events_.front(), &events_.back() + 1); }

Implement event batching in the miniframework #143

Are you sure you want to change the base?

Implement event batching in the miniframework #143

Conversation

fwyzard commented Dec 3, 2020 • edited Loading

Introduce a new way of passing information from acquire to produce:

Add a doneWaiting overload that does not take an exception pointer argument

Implement basic support for event batching, trying to have minimal impact on the existing modules:

Notes

About edm::EventRange and edm::ConstEventRange

Alternative approaches for acquire()

fwyzard commented Dec 4, 2020

makortel commented Dec 4, 2020

fwyzard commented Dec 4, 2020

fwyzard commented Dec 5, 2020 • edited Loading

fwyzard commented Dec 6, 2020 • edited Loading

makortel left a comment

Choose a reason for hiding this comment

makortel Dec 15, 2020

Choose a reason for hiding this comment

fwyzard Dec 15, 2020

Choose a reason for hiding this comment

makortel Dec 15, 2020

Choose a reason for hiding this comment

fwyzard Dec 15, 2020

Choose a reason for hiding this comment

makortel Dec 15, 2020

Choose a reason for hiding this comment

makortel commented Dec 15, 2020

makortel commented Dec 15, 2020

fwyzard commented Dec 15, 2020

makortel commented Dec 15, 2020

makortel commented Dec 15, 2020

fwyzard commented Dec 15, 2020

makortel commented Dec 17, 2020

fwyzard commented Dec 17, 2020

fwyzard commented Dec 3, 2020 •

edited

Loading

Introduce a new way of passing information from `acquire` to `produce`:

Add a `doneWaiting` overload that does not take an exception pointer argument

About `edm::EventRange` and `edm::ConstEventRange`

Alternative approaches for `acquire()`

fwyzard commented Dec 5, 2020 •

edited

Loading

fwyzard commented Dec 6, 2020 •

edited

Loading