Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: assert grad_chunk.l2_norm is not None #6102

Closed
1 task done
liangzz1991 opened this issue Oct 25, 2024 · 12 comments
Closed
1 task done

[BUG]: assert grad_chunk.l2_norm is not None #6102

liangzz1991 opened this issue Oct 25, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@liangzz1991
Copy link

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

Modify the code to adapt to qwen2vl(transformers.Qwen2VLForConditionalGeneration) and find that the loss can be calculated, but partial chunk : grad_chunk.l2_norm is None.........(LLM is ok)
image

modified source code to print more information:
image
result:
image

Environment

No response

@liangzz1991 liangzz1991 added the bug Something isn't working label Oct 25, 2024
@Edenzzzz
Copy link
Contributor

Edenzzzz commented Oct 25, 2024

For this kind of issue, it makes more sense to share what modifications you made, since we don't support individual changes.

@liangzz1991
Copy link
Author

For this kind of issue, it makes more sense to share what modifications you made, since we don't support individual changes.

It's the problem I described. I didn't modify the source code. I used ColossalAI to train qwen2vl. The place where I modified the source code is on the 2 lines of code shown in the second picture, just to print the error details.

The following is part of the training code. The error location is ‘optimizer.step()’, as shown in the first picture.

` for step, batch in enumerate(prefetcher, start=st_step):
collect_metric("time", "dataloader", time_metric)
# print(batch)
if step == args.train_steps:
logger.info(f"rank-{rank} -> max train step reached ({step}/{args.train_steps}), stop training")
break

    if not args.only_forward and one_rank_done.item() > 0:
        step -= 1  # to avoid inconsistent step between ranks
        break
    if not args.only_forward and prefetcher.next is None:
        logger.info(f"rank-{rank} -> no next batch in prefetcher, stop training")
        one_rank_done.add_(1.0)
    distributed.all_sum(one_rank_done)

    cur_global_tokens = batch["n_tokens"]
    distributed.all_sum(cur_global_tokens)

    batch_token_pass = cur_global_tokens.item() / args.tp_size * args.extra_dp_size
    global_token_pass += batch_token_pass
    cost_track_acc_tokens += batch_token_pass
    cost_track_acc_batches += args.batch_size
    extra_states["global_token_pass"] = global_token_pass

    cur_global_samples = batch["n_samples"]
    distributed.all_sum(cur_global_samples)
    batch_sample_pass = cur_global_samples.item() / args.tp_size * args.extra_dp_size
    global_sample_pass += batch_sample_pass
    extra_states["global_sample_pass"] = global_sample_pass

    # core logic
    outputs = model(**{key: batch[key] for key in forward_keys})
    dataset.track(batch["stats"])
    collect_metric("time", "forward", time_metric)

    reduction = "mean" if not args.only_forward else "sum"
    loss = lm_cross_entropy(outputs.logits, batch["labels"], reduction=reduction)

    if args.z_loss_coef > 0.0:
        loss += args.z_loss_coef * lm_z_loss(outputs.logits).to(loss.device)

    global_aux_loss = 0.0
    if hasattr(outputs, "aux_loss") and outputs.aux_loss is not None:
        aux_loss = outputs.aux_loss.to(loss.device)
        distributed.all_mean(aux_loss)
        global_aux_loss = aux_loss.item()

        if args.add_aux_loss:
            loss += args.router_aux_loss_coef * outputs.aux_loss.to(loss.device)

    loss /= args.grad_accum_steps
    print(loss)
    if not args.only_forward:
        booster.backward(loss, optimizer)
    collect_metric("time", "backward", time_metric)

    if not args.only_forward and step % args.grad_accum_steps == 0:
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()`

@liangzz1991
Copy link
Author

Is there an existing issue for this bug?

  • I have searched the existing issues

🐛 Describe the bug

Modify the code to adapt to qwen2vl(transformers.Qwen2VLForConditionalGeneration) and find that the loss can be calculated, but partial chunk : grad_chunk.l2_norm is None.........(LLM is ok) image

modified source code to print more information: image result: image

Environment

No response

The third picture shows that loss can be printed

@Edenzzzz
Copy link
Contributor

Seems that your code is not the newest version. Could you pull the newest main branch and try again?

@liangzz1991
Copy link
Author

Seems that your code is not the newest version. Could you pull the newest main branch and try again?

torch 2.4.0 + colossalai 0.4.5 : same error

image

image

@liangzz1991
Copy link
Author

This is a simple piece of code that will give the same error:

```
import torch
import torch.nn as nn
from transformers import GPT2LMHeadModel,GPT2Config
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

from colossalai.nn.optimizer import HybridAdam
import colossalai
from colossalai.booster import Booster
from colossalai.lazy import LazyInitContext
from colossalai.booster.plugin import GeminiPlugin
import os
# os.environ["RANK"] = "0"
# os.environ["LOCAL_RANK"] = "0"

# class GPTLMModel(nn.Module):

#     def __init__(self,
#                  hidden_size=768,
#                  num_layers=12,
#                  num_attention_heads=12,
#                  max_seq_len=1024,
#                  vocab_size=50257,
#                  checkpoint=False):
#         super().__init__()
#         self.checkpoint = checkpoint
#         self.model = GPT2LMHeadModel(
#             GPT2Config(n_embd=hidden_size,
#                        n_layer=num_layers,
#                        n_head=num_attention_heads,
#                        n_positions=max_seq_len,
#                        n_ctx=max_seq_len,
#                        vocab_size=vocab_size))
#         if checkpoint:
#             self.model.gradient_checkpointing_enable()

#     def forward(self, input_ids, attention_mask):
#         return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]

# def gpt2_medium(checkpoint=False):
#     return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)

class GPTLMLoss(nn.Module):

    def __init__(self):
        super().__init__()
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, logits, labels):
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    
def get_data(batch_size, seq_len, vocab_size):
    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=torch.cuda.current_device())
    attention_mask = torch.ones_like(input_ids)
    return input_ids, attention_mask

def main():
    # args = parse_args()
    BATCH_SIZE = 8
    SEQ_LEN = 1024
    VOCAB_SIZE = 50257
    NUM_STEPS = 10
    colossalai.launch_from_torch()
    
    # build GPT model
    with LazyInitContext(default_device=torch.device('cuda')):
      # model = gpt2_medium(checkpoint=True)
      model = Qwen2VLForConditionalGeneration.from_pretrained("/data/liuxiaoyu/liuxiaoyu/models/Qwen2-VL-2B-Instruct", device_map="auto"
                                                              )
    # build criterion
    criterion = GPTLMLoss()
    optimizer = HybridAdam(model.parameters(), lr=0.001)

    torch.manual_seed(123)
    


    # Gemini + ZeRO DP
    plugin = GeminiPlugin(max_norm=1.0, initial_scale=2**5)
    booster = Booster(plugin=plugin)
    model, optimizer, criterion, _, _ = booster.boost(model, optimizer, criterion)

    torch.cuda.synchronize()
    model.train()
    for n in range(NUM_STEPS):
        print(n)
        # we just use randomly generated data here
        input_ids, attn_mask = get_data(BATCH_SIZE, SEQ_LEN, VOCAB_SIZE)
        optimizer.zero_grad()
        outputs = model(input_ids, attn_mask)
        # print(outputs)
        loss = criterion(outputs.logits, input_ids)
        booster.backward(loss, optimizer)
        optimizer.step()

    torch.cuda.synchronize()
    
if __name__ =='__main__':
    main()
```

@liangzz1991
Copy link
Author

@Edenzzzz

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@Edenzzzz

@liangzz1991
Copy link
Author

@wangbluo @gothicx

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@wangbluo @gothicx

@Edenzzzz
Copy link
Contributor

@botbw Any insights? Thanks

@liangzz1991
Copy link
Author

@botbw Any insights? Thanks
It was my mistake. The “Qwen2VLForConditionalGeneration” model input requires complete input of "input_ids", "attention_mask", "pixel_values", and "image_grid_thw". Although it can run only with "input_ids" and "attention_mask", the above will appear. The error I mentioned

@botbw botbw closed this as completed Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants