Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking Prediction Speed #126

Closed
jaderabbit opened this issue Dec 18, 2018 · 17 comments
Closed

Benchmarking Prediction Speed #126

jaderabbit opened this issue Dec 18, 2018 · 17 comments
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix

Comments

@jaderabbit
Copy link
Contributor

In reference to following tweet:

Would it be possible to do a benchmark on the speed of prediction? I was working with the tensorflow version of BERT, but it uses the new Estimators and I'm struggling to find a straight-forward way to benchmark it since it all gets hidden in layers of computation graph. I'd imagine pytorch being more forgiving in this regard.

@thomwolf
Copy link
Member

Do you have a dataset in mind for the benchmark?
We can do a simple benchmark by timing the duration of evaluation on the SQuAD dev set for example.

@jaderabbit
Copy link
Contributor Author

Yes, that would be perfect! Ideally, it would exclude loading and setting up the model (something that the tf implementation literally does not allow for :P)

@thomwolf
Copy link
Member

Hi Jade,

I did some benchmarking on a V100 GPU. You can check the script I used on the benchmark branch (mostly added timing to run_squad).

Here are the results:
prediction_speed_bert_1

max_seq_length fp32 fp16
384 140 352
256 230 751
128 488 1600
64 1030 3663

I will give a look on an older K80 (without fp16 support) when I have time.

@jaderabbit
Copy link
Contributor Author

This is fantastic! Thank you so so so so much!

If you get a chance to do the K80, that would be brilliant. I'll try run it when I get time. Currently doing a cost versus speed comparison just to get a feel.

@thomwolf
Copy link
Member

thomwolf commented Dec 19, 2018

You can run it like this for fp32 (just remove --do_train):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

And like this for fp16 (add --predict_fp16):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Adjust predict_batch_size 128 to fill your GPU around 50% at least and adjust --max_seq_length 384 to test with various sequence lengths. For small sequences (under 64 tokens) we should desactivate the windowing (related to doc_stride). I didn't take time to do that so the dataset reading didn't work (hence the absence of datapoint).

@jaderabbit
Copy link
Contributor Author

Fantastic. Tomorrow I'm going to run it for some smaller max sequence lengths (useful for my use case) and on some other GPUS: The Tesla M60 and then the K80

@jaderabbit
Copy link
Contributor Author

jaderabbit commented Jan 2, 2019

Managed to replicate your results on the V100. :)

Also, I've done the experiments below for sequences of length 64 on different GPUS. Will do the other sequence lengths when I get a chance.

GPU max_seq_length fp32 fp16
Tesla M60 64 210 N/A
Tesla K80 64 143 N/A

@rodgzilla
Copy link
Contributor

@thomwolf @jaderabbit Thank you for the experiments.

I think these results deserves more visibility, maybe a dedicated markdown page or a section in the README.md?

@thomwolf
Copy link
Member

thomwolf commented Jan 7, 2019

Your are right Gregory.
The readme is starting to be too big in my opinion.
I will try to setup a sphinx/ReadTheDocs online doc later this month (feel free to start a PR if you have experience in these kind of stuff).

@rodgzilla
Copy link
Contributor

I'm more or less new to sphinx but I would be happy to work on it with you.

@thomwolf
Copy link
Member

thomwolf commented Jan 7, 2019

Sure, if you want help that could definitely speed up the process.

The first step would be to create a new branch to work on with a docfolder and then generate the doc in the folder using sphinx.

Good introductions to sphinx and readthedoc are here: http://www.ericholscher.com/blog/2016/jul/1/sphinx-and-rtd-for-writers/
and here: https://docs.readthedocs.io/en/latest/intro/getting-started-with-sphinx.html

We will need to add some dependencies for the but we should strive to keep it as light as possible.
Here is an example of repo I've worked on recently (still a draft but the doc is functional) https://github.com/huggingface/adversarialnlp

@apurvaasf
Copy link

apurvaasf commented Jan 9, 2019

Hi @thomwolf ,
I am looking to deploy a pre-trained squad-bert model to make predictions in real-time.
Right now when I run:
python run_squad.py \ --bert_model bert-base-uncased \ --do_predict \ --do_lower_case \ --train_file $SQUAD_DIR/train-v1.1.json \ --predict_file $SQUAD_DIR/test.json \ --predict_batch_size 128 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/
it takes 22 seconds to generate the prediction. Is there a way to reduce the amount off time taken to less than a second?

The "test.json" has one context and 1 question on the same. It looks like this:
{ "data": [ { "title": "Arjun", "paragraphs": [ { "context": "Arjun died in 1920. The American Football Club (AFC) celebrated this death. Arjun now haunts NFC. He used to love playing football. But nobody liked him.", "qas": [ { "question": "When did Arjun die?", "id": "56be4db0acb8001400a502ed" } ] } ] } ] }

Please help me with this. I switched to using the PyTorch implementation hoping that getting a saved model and making predictions using the saved model will be easier in PyTorch.

@jaderabbit
Copy link
Contributor Author

@apurvaasf Might be worth opening another ticket since that's slightly different to this. It shouldn't be too hard to write your own code for deployment. The trick is to make sure it does all the loading once, and just calls predict each time you need a prediction.

@stale
Copy link

stale bot commented May 5, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 5, 2019
@stale stale bot closed this as completed May 12, 2019
@hamediramin
Copy link

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

@mitsuix
Copy link

mitsuix commented Aug 6, 2019

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

@CaesarWWK
Copy link

CaesarWWK commented Jun 3, 2020

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

albanD commented on 25 Mar
Hi,

We use github issues only for bugs or feature requests.
Please use the forum to ask questions: https://discuss.pytorch.org/ as mentionned in the template you used.

Note that in your case, you are most likely missing torch.cuda.syncrhonize() when timing your GPU code which makes the copy look much slower than it is because it has to wait for the rest of the work to be done.

#Pytorch#35292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Discussion on a topic (keep it focused or open a new issue though) wontfix
Projects
None yet
Development

No branches or pull requests

7 participants