-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorFlow training/inference optimization #7605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's all it takes, that's fantastic! Did you manage to obtain the performance improvements that were initially mentioned thanks to this?
Also I'm realizing now that we don't have integration testing for our TensorFlow models, and this seems like a situation where having some would be needed. Could we work on adding these tests for the models modified here at first, and then add it to the rest of the models?
Something like is done in tests/test_modeling_roberta.py
, using tiny models.
I can help you work on it if you're lacking time!
On my machine with my GPU yes.
Sure! It is a good idea!
I would appreciate if you have time yes 😃 |
Okay, will take a look at doing the integrations tests sometimes tonight. Will let you know! |
For learning purpose, I am wondering which operations was done on CPU instead of GPU. I saw you changed |
If you take a look at #6771 is it quite well detailed. The issue was coming from transpose+matmul that was done on CPU. einsumDense allows you to do all these computation directly in the layer but at the cost of changing the shapes of the original layers, that why we have modified the way we load the TF models. To do this PR I basically took example on the original BERT implementation right here. |
Thanks a lot @LysandreJik !! As I'm currently working on from scratch LM training for TF models, I don't have much time to really focus on this. |
@jplu Thanks. I am superised by this |
@jplu You also works on LM training for TF models? I plan to go back to a pending PR #6955 I created once the |
@chiapas This is exactly what I'm doing, and the models needs some rework that's why I'm mostly focus on BERT to have at least one model working. I just done yesterday the data pipeline with random masking generation. |
Ah, ok. I guess my PR was pending too long and it is my bad not to communicate with you first. I planed to do this while I finished a notebook on Kaggle Masked, My Dear Watson - MLM with TPU, which also works on MLM. Since you already have more progresses (and also you are HF member), it is better for you to continue. However, if there is something I can contribute for this TF LM task, I would love to do it. |
Thanks! I will let you know. |
That's awesome! I will see what results the TF benchmark scripts give before/after this PR. Strongly agree with @LysandreJik that we should add integration tests before merging this PR. |
I ran the benchmarks:
Currently, on master:
In this
=> So the speed results are more or less identical with the way the benchmarks are used. I don't compile the model with Keras, but just add the "@tf.function" decorator to the function to transform the function into graph mode. So not sure what to think of that.... => @jplu - colud you maybe check the benchmark script and see if you can get a speed-up there? Or if the benchmark script is wrong?
|
The benchmark script is ok, but to see the difference you have to create a saved_model and run the model in TF Serving. Your benchmark don't take into account all the optimization TF serving does for inference. We should update the benchmark script to include:
|
Will be integrated into the PR #7753 |
What does this PR do?
This PR fixes a performance issue where some operation was done on CPU instead of GPU and would result to put the GPU in idle mode. This optimization is feasible thanks to the recent update we made on the way we load the TF weights.
@patrickvonplaten I have done few changes in the
TFLongformer
model but I'm sure it can be further optimized the same way (seeTFLongformerSelfAttention
) but as I don't know much on how works this model, can you take a look if the same optimization can be applied?Fixes #6771