-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix slowdown of Cheetah 0.7.1 compared to 0.6.3 #367
Conversation
…en.set_read_beam`
Some speed benchmarks: Env benchmark%%timeit
observation, info = env.reset()
done = False
while not done:
observation, reward, terminated, truncated, info = env.step(env.action_space.sample())
done = terminated or truncated
Raw Cheetah benchmarksegment.AREABSCR1.is_active = True %%timeit
outgoing = segment.track(incoming)
|
It appears there are two main causes: I just ran some tests how this compares, for example, against setting The last commit therefore sets
|
…ed ... and all that was needed for that
I just made a modification to the It would also have to be checked with @Hespe why the original order of buffer registration was introduced. Also, if we choose to keep the modification, it should be made in all
|
Another realisation: The main expense in For reference, here are a few experiments I did on the cost of running broadcasting in different ways by itself. |
Would be interesting to see how much we pay for broadcasting if all tensors are scalars. |
This reverts commit a768df8.
…t related and are needed
Ok, so it's a roughly 20% increase that we have to pay in these two cases if we want to be able to assign |
Hmm ... so from a user experience point of view I never liked that idea. It invites programming errors if your incoming beam is different after tracking. On the other hand, reserving new memory does introduce a non-neglible overhead. I'm thinking right now if introducing an |
I also just added the profiling outputs for both cases as well. What still confuses me there is that the profiles outputs more than 2x with the separated buffer registration. |
With the idea being |
One more note. In the profiler output it seems that the more recent ones are about 10x - 20x slower ... this is simply because I was using 10th of the sampling interval at one point. |
It could probably be even be implemented in the def track(self, incoming: Beam, inplace: bool = True) -> Beam:
outgoing = incoming if inplace else incoming.clone()
outgoing.particles = # Do computations ... On the other hand, |
Oh yes, you are right. In place tensor operations on the other hand would be troublesome here, right? They would be incompatible with automatically broadcasting within the elements. And could lead to problems with differentiation? |
According to the benchmark I just posted, the inplace operation would be faster. We should consider this for the future, but for the specific ARES RL example that I need to speed up right now, the code actually combines all transfer maps, and then does only one multiplication. Even in the suggested case there would have to be at least one new beam copy created, and here it only creates that one. So this optimisation would not give an advantage in this case. Similarly the copy for The only other source of slowdown as far as I can tell is the broadcasting in
The example I'm running uses a vectorised version of Cheetah (i.e. >=0.7.0), but non of the inputs have a vector dimension. |
I was referring to your micro benchmark above, that just does some broadcasting of tensors |
@Hespe and I just went through everything and optimised every last bit on the ARES example that we think can be optimised. The current numbers are:
This is around a 2x improvement over the times we had at the beginning of this PR. The rest of the slowdown is simply needed for vectorised Cheetah, and btw also payed for as soon as you run at least two samples. |
The question came up if To be discussed with @Hespe. |
…all beams and elements
…ersion of Cheetah
I just reran everything with the most recent commit, to make sure that the changes didn't negatively affect anything. The current numbers are:
These were all run on an Apple M1 Pro. We also ran a test earlier on a Dell Windows laptop, and saw that compute times were about 1.7x longer. Funnily enough, the results of |
For reference: The RL training went from 1,450 fps with v0.6.1 to 1,343 fps with v0.7.1 ... so less than 8% worse, which I'm pretty happy with. It was around 400 fps when this PR was opened. |
@jank324 @Hespe On a side note, it would be helpful to gather the insights & experience here into a written document (e.g. to |
Description
Introduces a couple of changes that improve the speed of Cheetah by about 2x (measured on the ARES RL example):
Screen
.Segment.track
that avoids the use ofSegment.is_skippable
, which is more expensive than expected.register_buffer
andregister_parameter
in__init__
s of beams and elements, because these are significantly more efficient that property assignments onnn.Modules
, which are actually very slow.Motivation and Context
Running the ARES RL code again with a prototype of Cheetah 0.7.1, I found that the samples per second were reduced by a factor of about 3.5x when compared to 0.6.1 (which this code ran with originally), but also 0.6.3 (the last non-vectorised version.
This matches observations from the vectorisation PR #116, where at one point I wrote:
The goal of Cheetah is also to be fast, especially for the purpose of RL, so we should check if this can be fixed.
Types of changes
Checklist
flake8
(required).pytest
tests pass (required).pytest
on a machine with a CUDA GPU and made sure all tests pass (required).Note: We are using a maximum length of 88 characters per line.