Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] How to run and aggregate validation during training? #3043

Open
HongtaoYang opened this issue Mar 17, 2023 · 1 comment
Open

[QUESTION] How to run and aggregate validation during training? #3043

HongtaoYang opened this issue Mar 17, 2023 · 1 comment

Comments

@HongtaoYang
Copy link

HongtaoYang commented Mar 17, 2023

I'm using Zero 3 for multi-gpu training. One thing I'm struggling with is how to periodically run validation during training? Related issue: #1863

I've made several attempts, but none work as intended.

Attempt 1: use multi-gpu

Following a solution on how to run inference with multiple gpus (huggingface/transformers#16616 (comment)), I tried this code:

ds_engine, _ = deepspeed.initialize(...)

for step, batch_data in enumerate(train_dataiter):
    # do training stuff
    # ......

    if step % 100 == 0:
        val_batch = next(val_dataiter)
        rank = torch.distributed.get_rank()
        val_batch_per_device = val_batch[rank*4:(rank+1)*4, ...]  # assume each gpu process 4 input samples.
        val_batch_per_device = val_batch_per_device.to(device=rank)

        ds_engine.module.eval()
        with torch.no_grad():
            outputs = ds_engine.module(val_batch_per_device, return_dict=True)
        ds_engine.module.train()

This way I can split the validation batch into chunks, and feed each gpu a different chunk. However, I don't know how to average the outputs from each gpu.

Attempt 2: use single gpu for validation

ds_engine, _ = deepspeed.initialize(...)

for step, batch_data in enumerate(train_dataiter):
    # do training stuff
    # ......

    rank = torch.distributed.get_rank()
    if step % 100 == 0 and rank == 0:
        val_batch = next(val_dataiter)
        val_batch = val_batch.to(device=rank)

        ds_engine.module.eval()
        with torch.no_grad():
            outputs = ds_engine.module(val_batch, return_dict=True)
        ds_engine.module.train()

In this case, I was hoping that all validation computation will happen at gpu 0, and all the other gpus will wait for validation to finish, then proceed to the next training iteration. However, this code will get stuck indefinitely at line outputs = ds_engine.module(val_batch, return_dict=True).

Attempt 3: just use deepspeed model engine

If model engine can do training, I can easily use the sam engine for inference right? Sadly no.

ds_engine, _ = deepspeed.initialize(...)

for step, batch_data in enumerate(train_dataiter):
    # do training stuff
    # ......

    if step % 100 == 0:
        rank = torch.distributed.get_rank()
        ds_engine.module.eval()
        
        val_batch = next(iter(val_dataloader))
        val_batch = val_batch.to(device=rank)
        val_loss = ds_engine(val_batch).loss

It turns our that by setting the model to eval mode (ds_engine.module.eval()), it will get CUDA OOM when running forward pass with the engine. If I avoid setting to eval mode, this code run with no problem. However, I still don't know how to average across gpus.

Attempt 4: use deepspeed inference engine

ds_engine, _ = deepspeed.initialize(...)
ds_inference_engine = deepspeed.init_inference(...)

for step, batch_data in enumerate(train_dataiter):
    # do training stuff
    # ......

    rank = torch.distributed.get_rank()
    if step % 100 == 0:
        val_batch = next(iter(val_dataloader))
        val_batch = val_batch.to(device=rank)
        val_loss = ds_inference_engine(val_batch).loss

Again, in this case I don't know how to average val_loss across all gpus. Also, use ds_inference_engine together with ds_engine gives CUDA OOM error. I guess using two engines just take double the gpu memory?

Any guide on how to run validation with deepspeed? Thanks!

@tjruwase
Copy link
Contributor

tjruwase commented Mar 21, 2023

Regarding your attempt 1, please see here for an example of how to average outputs across gpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants