Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem when speeding up fine-tuning bert-base-uncased on ReCoRD #1099

Closed
ThangPM opened this issue Jun 27, 2020 · 2 comments
Closed

Problem when speeding up fine-tuning bert-base-uncased on ReCoRD #1099

ThangPM opened this issue Jun 27, 2020 · 2 comments
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2

Comments

@ThangPM
Copy link

ThangPM commented Jun 27, 2020

Hello,

I am trying to reproduce result for ReCoRD task by fine-tuning bert-base-uncased model but it takes days for 1 GPU (Tesla V100) because the training set is quite big (~1.13M examples).

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = 0, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 8, val_interval = 10000, val_data_limit = -1

06/26 12:38:54 PM: Update 340556: task record, steps since last val 556 (total steps = 340556): f1: 0.5516, em: 0.5403, avg: 0.5459, record_loss: 0.1962
06/26 12:39:04 PM: Update 340603: task record, steps since last val 603 (total steps = 340603): f1: 0.5577, em: 0.5446, avg: 0.5512, record_loss: 0.1899

It takes 10 secs for (340603 - 340556) = 47 steps

I decided to speed up this process by using 8 GPUs (still Tesla V100) and update batch_size from 8 to 128 but it seems to take longer than 1 GPU.

python main.py --config jiant/config/superglue_bert.conf --overrides random_seed = 42, cuda = auto, run_name = record, pretrain_tasks = "record", target_tasks = "record", do_pretrain = 1, do_target_task_training = 0, do_full_eval = 1, batch_size = 128, val_interval = 10000, val_data_limit = -1

06/27 12:50:54 PM: Update 452155: task record, steps since last val 2155 (total steps = 452155): f1: 0.2494, em: 0.2414, avg: 0.2454, record_loss: 0.3956
06/27 12:51:08 PM: Update 452170: task record, steps since last val 2170 (total steps = 452170): f1: 0.2492, em: 0.2413, avg: 0.2453, record_loss: 0.3953

Now it takes around 14 secs for only 15 steps. Am I doing anything wrong or is this an issue?

Any comments would be appreciated.

@sleepinyourhat
Copy link
Contributor

@phu-pmh, @pruksmhc - Any guess what's up here?

@zphang
Copy link
Collaborator

zphang commented Oct 16, 2020

This is an automatically generated comment.

As we update jiant to v2.x, jiant v1.x has been migrated to https://github.com/nyu-mll/jiant-v1-legacy. As such, we are closing all issues relating to jiant v1.x in this repository.

If this issue is still affecting you in jiant v1.x, please follow up at nyu-mll/jiant-v1-legacy#1099.

If this issue is still affecting you in jiant v2.x, reopen this issue or create a new one.

@zphang zphang closed this as completed Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

No branches or pull requests

4 participants