-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realising a factor 20-30 speedup on GPU #1803
Comments
This looks fantastic! Just one question, what effect does it have on CPU? |
Good question, I haven't tested as much but it seems point 1 by itself actually slows it down by a factor of 2, for a single replica. Points 3 and 4 and probably 2 as well can only speed it up. I saw some comments I think by you on einsum not being as efficient on the CPU, not sure if that's still the case but that may be it. So for the clean implementation it may be necessary to put some branchings, maybe reverting back to the old version in key places when the number of replicas is 1 (assuming it only makes sense to run with 1 replica on the CPU). |
I think this is a good assumption. I don't know what computers people have access to, but in my experience it is more convenient to run many small jobs rather than a big one in clusters (mainly due to queues, and not only thinking about nnpdf). |
Last week @goord and I started looking at tensorboard profiles of the code running on a GPU. We found and resolved several bottlenecks in the performance, resulting in a total speedup of a factor of 20-30 compared to the current state of the trvl-mask-layers branch of #1788.
As a result, we are able to do a full 17k epoch run of the NNPDF40_nnlo_as_01180_100 runcard with 100 replicas within half an hour.
We have this running, so the time quoted is the actual start to end wall time (to be precise, it took 19 minutes, 9 of which are loading the data and building the model etc).
Most of it still requires a lot of cleanup to integrate it properly though. Currently it crashes just after the fit just because the appropriate changes haven't been made.
Factors contributing to speedup
In no particular order, the factors contributing to the speedup are:
tf.boolean_mask
, where the output shape depends on the values, i.e. the number of Trues, to being a matrix multiplication, and precomputed this. So the FK table for a DY experiment is now of shape (n, x, f, x, f)Steps remaining
Unfortunately I'll have very little time in the next month to work on this (holidays and other projects). Below I'll list the steps necessary to integrate it, and where help could be useful.
keras.Loss
or something, not sure if that's efficient or not, just to not repeat the computations), I think this is also relatively independent of the rest. If anyone wants to do this that'd be great, it doesn't have the highest payout/effort ratio of all these, so I can also do it myself after the last point. UPDATE: WIP in Avoiding duplicated computations by having a single observable model #1855Tensorboard profile
Here is the tensorboard profile with all these improvements, may be nice to see:
The text was updated successfully, but these errors were encountered: