Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when speeding up fine-tuning bert-base-uncased on ReCoRD #1099

Open
jeswan opened this issue Sep 17, 2020 · 1 comment
Open

Problem when speeding up fine-tuning bert-base-uncased on ReCoRD #1099

jeswan opened this issue Sep 17, 2020 · 1 comment

Comments

@jeswan
Copy link
Contributor

jeswan commented Sep 17, 2020

Issue by ThangPM
Saturday Jun 27, 2020 at 16:53 GMT
Originally opened as nyu-mll/jiant#1099


Hello,

I am trying to reproduce result for ReCoRD task by fine-tuning bert-base-uncased model but it takes days for 1 GPU (Tesla V100) because the training set is quite big (~1.13M examples).

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = 0, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 8, val_interval = 10000, val_data_limit = -1

06/26 12:38:54 PM: Update 340556: task record, steps since last val 556 (total steps = 340556): f1: 0.5516, em: 0.5403, avg: 0.5459, record_loss: 0.1962
06/26 12:39:04 PM: Update 340603: task record, steps since last val 603 (total steps = 340603): f1: 0.5577, em: 0.5446, avg: 0.5512, record_loss: 0.1899

It takes 10 secs for (340603 - 340556) = 47 steps

I decided to speed up this process by using 8 GPUs (still Tesla V100) and update batch_size from 8 to 128 but it seems to take longer than 1 GPU.

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = auto, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 128, val_interval = 10000, val_data_limit = -1

06/27 12:50:54 PM: Update 452155: task record, steps since last val 2155 (total steps = 452155): f1: 0.2494, em: 0.2414, avg: 0.2454, record_loss: 0.3956
06/27 12:51:08 PM: Update 452170: task record, steps since last val 2170 (total steps = 452170): f1: 0.2492, em: 0.2413, avg: 0.2453, record_loss: 0.3953

Now it takes around 14 secs for only 15 steps. Am I doing anything wrong or is this an issue?

Any comments would be appreciated.

@jeswan
Copy link
Contributor Author

jeswan commented Sep 17, 2020

Comment by sleepinyourhat
Tuesday Jun 30, 2020 at 19:09 GMT


@phu-pmh, @pruksmhc - Any guess what's up here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant