Benchmarking Prediction Speed #126

jaderabbit · 2018-12-18T13:21:51Z

In reference to following tweet:

Would it be possible to do a benchmark on the speed of prediction? I was working with the tensorflow version of BERT, but it uses the new Estimators and I'm struggling to find a straight-forward way to benchmark it since it all gets hidden in layers of computation graph. I'd imagine pytorch being more forgiving in this regard.

thomwolf · 2018-12-18T13:28:09Z

Do you have a dataset in mind for the benchmark?
We can do a simple benchmark by timing the duration of evaluation on the SQuAD dev set for example.

jaderabbit · 2018-12-18T15:05:44Z

Yes, that would be perfect! Ideally, it would exclude loading and setting up the model (something that the tf implementation literally does not allow for :P)

thomwolf · 2018-12-19T12:11:03Z

Hi Jade,

I did some benchmarking on a V100 GPU. You can check the script I used on the benchmark branch (mostly added timing to run_squad).

Here are the results:

max_seq_length	fp32	fp16
384	140	352
256	230	751
128	488	1600
64	1030	3663

I will give a look on an older K80 (without fp16 support) when I have time.

jaderabbit · 2018-12-19T14:30:40Z

This is fantastic! Thank you so so so so much!

If you get a chance to do the K80, that would be brilliant. I'll try run it when I get time. Currently doing a cost versus speed comparison just to get a feel.

thomwolf · 2018-12-19T14:35:13Z

You can run it like this for fp32 (just remove --do_train):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

And like this for fp16 (add --predict_fp16):

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_predict \
  --predict_fp16 \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --predict_batch_size 128 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

Adjust predict_batch_size 128 to fill your GPU around 50% at least and adjust --max_seq_length 384 to test with various sequence lengths. For small sequences (under 64 tokens) we should desactivate the windowing (related to doc_stride). I didn't take time to do that so the dataset reading didn't work (hence the absence of datapoint).

jaderabbit · 2018-12-19T14:53:31Z

Fantastic. Tomorrow I'm going to run it for some smaller max sequence lengths (useful for my use case) and on some other GPUS: The Tesla M60 and then the K80

jaderabbit · 2019-01-02T12:41:02Z

Managed to replicate your results on the V100. :)

Also, I've done the experiments below for sequences of length 64 on different GPUS. Will do the other sequence lengths when I get a chance.

GPU	max_seq_length	fp32	fp16
Tesla M60	64	210	N/A
Tesla K80	64	143	N/A

rodgzilla · 2019-01-07T11:19:41Z

@thomwolf @jaderabbit Thank you for the experiments.

I think these results deserves more visibility, maybe a dedicated markdown page or a section in the README.md?

thomwolf · 2019-01-07T11:48:34Z

Your are right Gregory.
The readme is starting to be too big in my opinion.
I will try to setup a sphinx/ReadTheDocs online doc later this month (feel free to start a PR if you have experience in these kind of stuff).

rodgzilla · 2019-01-07T15:24:13Z

I'm more or less new to sphinx but I would be happy to work on it with you.

thomwolf · 2019-01-07T21:28:33Z

Sure, if you want help that could definitely speed up the process.

The first step would be to create a new branch to work on with a docfolder and then generate the doc in the folder using sphinx.

Good introductions to sphinx and readthedoc are here: http://www.ericholscher.com/blog/2016/jul/1/sphinx-and-rtd-for-writers/
and here: https://docs.readthedocs.io/en/latest/intro/getting-started-with-sphinx.html

We will need to add some dependencies for the but we should strive to keep it as light as possible.
Here is an example of repo I've worked on recently (still a draft but the doc is functional) https://github.com/huggingface/adversarialnlp

apurvaasf · 2019-01-09T10:55:06Z

Hi @thomwolf ,
I am looking to deploy a pre-trained squad-bert model to make predictions in real-time.
Right now when I run:
python run_squad.py \ --bert_model bert-base-uncased \ --do_predict \ --do_lower_case \ --train_file $SQUAD_DIR/train-v1.1.json \ --predict_file $SQUAD_DIR/test.json \ --predict_batch_size 128 \ --learning_rate 3e-5 \ --num_train_epochs 2.0 \ --max_seq_length 384 \ --doc_stride 128 \ --output_dir /tmp/debug_squad/
it takes 22 seconds to generate the prediction. Is there a way to reduce the amount off time taken to less than a second?

The "test.json" has one context and 1 question on the same. It looks like this:
{ "data": [ { "title": "Arjun", "paragraphs": [ { "context": "Arjun died in 1920. The American Football Club (AFC) celebrated this death. Arjun now haunts NFC. He used to love playing football. But nobody liked him.", "qas": [ { "question": "When did Arjun die?", "id": "56be4db0acb8001400a502ed" } ] } ] } ] }

Please help me with this. I switched to using the PyTorch implementation hoping that getting a saved model and making predictions using the saved model will be easier in PyTorch.

jaderabbit · 2019-01-11T06:52:55Z

@apurvaasf Might be worth opening another ticket since that's slightly different to this. It shouldn't be too hard to write your own code for deployment. The trick is to make sure it does all the loading once, and just calls predict each time you need a prediction.

stale · 2019-05-05T12:47:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

hamediramin · 2019-06-25T20:04:14Z

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

mitsuix · 2019-08-06T01:39:01Z

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

CaesarWWK · 2020-06-03T09:05:54Z

Hi @thomwolf and thanks for the amazing implementation. I wonder what is the inference speed with a 512 batch size. It seems to take a lot of time to convert to GPU (1000msec for a batch size of 32) and I wonder if there is any quick speedup/fix. I am concerned with the latency rather than the throughput.

Have you found any solutions? I've met the same problem.
The inference time is fast, but takes a lot of time to convert to GPU and convert the result to CPU for post-processing.

albanD commented on 25 Mar
Hi,

We use github issues only for bugs or feature requests.
Please use the forum to ask questions: https://discuss.pytorch.org/ as mentionned in the template you used.

Note that in your case, you are most likely missing torch.cuda.syncrhonize() when timing your GPU code which makes the copy look much slower than it is because it has to wait for the rest of the work to be done.

#Pytorch#35292

davidefiocco mentioned this issue Jan 15, 2019

run_lm_finetuning.py does not define a do_lower_case argument #177

Closed

thomwolf mentioned this issue Jan 31, 2019

WIP: Half precision training allenai/allennlp#2467

Closed

thomwolf added the Discussion Discussion on a topic (keep it focused or open a new issue though) label Mar 6, 2019

fmikaelian mentioned this issue Apr 1, 2019

Ability to tune parameters at prediction time cdqa-suite/cdQA#92

Closed

stale bot added the wontfix label May 5, 2019

stale bot closed this as completed May 12, 2019

tobyych mentioned this issue Aug 27, 2022

Memory increment and release when loading model via PretrainedModel.from_pretrained #18782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Prediction Speed #126

Benchmarking Prediction Speed #126

jaderabbit commented Dec 18, 2018

thomwolf commented Dec 18, 2018

jaderabbit commented Dec 18, 2018

thomwolf commented Dec 19, 2018

jaderabbit commented Dec 19, 2018

thomwolf commented Dec 19, 2018 •

edited

Loading

jaderabbit commented Dec 19, 2018

jaderabbit commented Jan 2, 2019 •

edited

Loading

rodgzilla commented Jan 7, 2019

thomwolf commented Jan 7, 2019 •

edited

Loading

rodgzilla commented Jan 7, 2019

thomwolf commented Jan 7, 2019

apurvaasf commented Jan 9, 2019 •

edited

Loading

jaderabbit commented Jan 11, 2019

stale bot commented May 5, 2019

hamediramin commented Jun 25, 2019

mitsuix commented Aug 6, 2019

CaesarWWK commented Jun 3, 2020 •

edited

Loading

Benchmarking Prediction Speed #126

Benchmarking Prediction Speed #126

Comments

jaderabbit commented Dec 18, 2018

thomwolf commented Dec 18, 2018

jaderabbit commented Dec 18, 2018

thomwolf commented Dec 19, 2018

jaderabbit commented Dec 19, 2018

thomwolf commented Dec 19, 2018 • edited Loading

jaderabbit commented Dec 19, 2018

jaderabbit commented Jan 2, 2019 • edited Loading

rodgzilla commented Jan 7, 2019

thomwolf commented Jan 7, 2019 • edited Loading

rodgzilla commented Jan 7, 2019

thomwolf commented Jan 7, 2019

apurvaasf commented Jan 9, 2019 • edited Loading

jaderabbit commented Jan 11, 2019

stale bot commented May 5, 2019

hamediramin commented Jun 25, 2019

mitsuix commented Aug 6, 2019

CaesarWWK commented Jun 3, 2020 • edited Loading

thomwolf commented Dec 19, 2018 •

edited

Loading

jaderabbit commented Jan 2, 2019 •

edited

Loading

thomwolf commented Jan 7, 2019 •

edited

Loading

apurvaasf commented Jan 9, 2019 •

edited

Loading

CaesarWWK commented Jun 3, 2020 •

edited

Loading