-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long simulations exhaust GPU resources #178
Comments
This is kind of an unavoidable aspect of how TensorFlow/Keras are set up. It is designed for situations where you pass a bunch of data to the device, run some computations, and then fetch the results back to the host. The solution you describe (running the simulation in a loop) is the recommended solution, allowing you to customize the simulation length based on your performance/memory needs.
The overhead of repeatedly launching simulations is primarily within the One thing to keep in mind is that if you don't need the Probe output from every timestep, you can use the |
Oh I should also mention that some of the suggestions here https://www.nengo.ai/nengo-dl/tips#speed-improvements should help with running the simulation in a loop (in particular if you can fit an unrolled model in memory then |
You can also use the |
An on-device version of
That might be the way to go to keep device memory at minimum. It's kind of the opposite of on-device nengo_dl.configure_settings(keep_history=False)
for _ in range(n_steps / like_sample_every):
sim.run_steps(like_sample_every)
I did some profiling on this. Most of the overhead occurs in keras (like you said), unless the number of steps is 16 or less, in which case Nengo DL becomes the bottleneck. The change is in keras. Here's the test: ExperimentThis command runs 500k neurons ten times, then picks the methods of interest out of the profile report. python -m cProfile -s tottime benchmark_wattsstrogatz.py dl 500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000 | egrep "(just_the_keras_call|_call_keras|predict_on_batch|run_steps|cumtime)" I put that "getattr" line into its own method ( Output with one iteration of 1000 steps
Output with 100 iterations of 10 steps
Some more data points:
When using the loop approach, 60% of that time is spent in Nengo DL, outside keras, compared to <0.3% when not looping. Here's a zoom in on that threshold, using a different number of total time steps for divisibility:
16 steps
17 steps
Keras is significantly faster for 16 compared to 17 steps. Perhaps there is a cost to coordinate multiple registers or something like that. When keras is at that 16-step sweet spot, Nengo DL accounts for a majority of the overhead. It has the effect of dulling the threshold unless you are specifically looking for it. Also, I think it motivates development of a version of |
I think this is just an artifact of the way you are printing the profiling data. I'm guessing you've got something like def just_the_keras_call():
getattr(self.keras_model, func_type)(**func_args) when you look at the If you look at the |
Yes, I think you are right. I was thinking of keras as this opaque call, which can be lumped in with the calling function by cprofile. Of course, that's not the case with keras because it has some non-compiled python before getting into primitive and/or compiled calls. cprofile records those, but they're missed by my grep. Thanks for pointing that out. There is still this odd and statistically significant threshold between 16 and 17 steps, but we're talking about much smaller 10% instead of 60% differences. What I'm getting from this discussion of overhead is that Nengo DL has a lot of interpreted overhead, but it's not a big deal because keras' is even worse. Is that fair? |
I think the 16 vs 17 steps thing is just an environmental quirk on your machine (or at least, it doesn't show up on my machine). There is also a lot less overhead in general on my machine. Here is what it looks like for me with 16 steps
vs 17 steps
Note that that is ~0.7% overhead in both cases. Here is how I am timing things (just doing it in-code without any grepping, for one 500000 neuron run) in case that makes a difference in your results with sim_class(model, **sim_kwargs) as sim:
t_sim = time.time()
# -- warmup
sim.run(0.01)
t_warm = time.time()
# -- long-term timing
with cProfile.Profile() as pr:
# sim.run(simtime)
steps_per_iteration = 16
for _ in range(n_steps // steps_per_iteration):
sim.run_steps(steps_per_iteration)
t_run = time.time()
pstats.Stats(pr).sort_stats("cumtime").print_stats(20) So based on my results anyway, I don't think I'd say it's the case that NengoDL has a lot of interpreted overhead, but I'd be interested why your environment seems to have more (although still <2% overhead in the worst case). |
Thank you for doing your own tests! Sorry for the delayed response. I agree, based on the cumtime measurement, that Nengo DL is not introducing significant additional overhead. I also learned how to use NVIDIA visual profiler. Not sure how it compares to the TF profiler, but it gives some pretty pictures that correspond to the results above. This is one iteration of the Suppose you did eliminate that kernel dead time, then the Keras python overhead associated with looping would be significant, although it amortizes with the number of steps per loop. I just found this interesting. I think we can conclude the original issue though. Perhaps a simple try/except block would be helpful for future Nengo DL users. Thanks again. |
Just an update, we do plan to add some documentation to help users diagnose/fix memory issues to help resolve this, it'll just be a little while before we have a chance to get to that I suspect. |
Probe data is transferred from GPU to host only at the end of a simulation's
run_steps
. The device has to allocate space for it. For example, a Probe on 1M neurons for 1k timesteps would require 1M x 8k x float32 = 32GB, a full GPU's worth.This can be solved by changing this
to
The first one gives an error:
I think you can reproduce that error by taking any stateful network and increasing the simulation steps. It would be helpful to catch that and give the user some more insight.
The second one takes a long time because control is returning to the python interpreter. You always have to return at some point to append the probe data; however, currently, I don't see an option to avoid doing all the checks and standardizations and progress bar launching within
sim.run_steps
,sim.predict_on_batch
, andsim._call_keras
on each iteration.Am I missing something that allows you to efficiently break up the simulation steps? I thought the
unroll_simulation
argument was it, but that seems to be different. Is there an option in keras to break up the steps, accumulate the data, and return the full data at the end? I'm a novice with keras.If there is no option to do that, then one idea would be a new method that does those checks once, then has a loop containing only this line
nengo-dl/nengo_dl/simulator.py
Line 1050 in 9fb7854
this block
nengo-dl/nengo_dl/simulator.py
Line 1176 in 9fb7854
and something like
Nengo OCL has an example of that loop here. It can run full speed basically until virtual memory is exhausted
https://github.com/nengo-labs/nengo-ocl/blob/5be4f5416ea4b2564fca1f5ec75bf28bfcc03829/nengo_ocl/simulator.py#L620
The text was updated successfully, but these errors were encountered: