[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

Gaiejj · 2024-07-25T04:21:15Z

Required prerequisites

I have read the documentation https://align-anything.readthedocs.io.
I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

What version of align-anything are you using?

0.1.0-dev

System information

transformers version: 4.43.1
Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
Python version: 3.11.9
Huggingface_hub version: 0.24.1
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:

Problem description

The llama3.1 can naturally be supported by the training, evaluation, and deployment modules of Align-Anything. However, according to our tests, due to some issues with the current transformers, it is temporarily unable to support deepspeed's ZeRO3 training. Our developers have reported this issue to the transformers community, and we have received a clear response and will continue to follow up.

This bug may affect the training of other types of models. Currently, if you need to use a stable version for training, you can temporarily use transformers version 4.41.2.

If you want to fine-tune llama3.1, we have verified that using ZeRO 2 can be implemented without errors in the latest 4.43.0 version of transformers.

Reproducible example code

import torch
import deepspeed
import json

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

from transformers.integrations.deepspeed import HfDeepSpeedConfig


DEFAULT_BOS_TOKEN: str = '<s>'
DEFAULT_EOS_TOKEN: str = '</s>'
DEFAULT_PAD_TOKEN: str = '<pad>'
DEFAULT_UNK_TOKEN: str = '<unk>'

model_name_or_path = 'PATHTO/Llama-3.1'
ds_cfgs_path = 'PATH'

deepspeed.init_distributed()

with open(ds_cfgs_path) as f:
    ds_cfgs = json.load(f)
    ds_cfgs['bf16']['enabled'] = True

dstchf = HfDeepSpeedConfig(ds_cfgs)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    model_max_length=2048,
    padding_side='right',
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
)

# Reference: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py
def resize_tokenizer_embedding(tokenizer, model) -> None:
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    def init_new_embeddings(
        embeddings,
        new_num_embeddings: int,
        num_new_embeddings: int,
    ) -> None:
        if embeddings is None:
            return

        params = [embeddings.weight]
        print(hasattr(embeddings.weight, 'ds_id'))
        # True for transformers 4.43.1, False for transformers 4.41.2
        exit()
        context = (
            deepspeed.zero.GatheredParameters(params, modifier_rank=0)
            if is_deepspeed_zero3_enabled()
            else contextlib.nullcontext()
        )
        with context:
            for param in params:
                if param is None:
                    continue
                assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
                # bug here, param size is 32000 while new_num_embeddings is 32001
                param_data = param.data
                param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
                param_data[-num_new_embeddings:] = param_mean

    special_tokens_dict = {}
    if tokenizer.pad_token is None:
        special_tokens_dict['pad_token'] = DEFAULT_PAD_TOKEN
    if tokenizer.eos_token is None:
        special_tokens_dict['eos_token'] = DEFAULT_EOS_TOKEN
    if tokenizer.bos_token is None:
        special_tokens_dict['bos_token'] = DEFAULT_BOS_TOKEN
    if tokenizer.unk_token is None:
        special_tokens_dict['unk_token'] = DEFAULT_UNK_TOKEN

    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    new_num_embeddings = len(tokenizer)

    model.config.bos_token_id = tokenizer.bos_token_id
    model.config.eos_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id

    if num_new_tokens > 0:
        hf_device_map = getattr(model, 'hf_device_map', {})
        devices = {
            torch.device(device)
            for device in hf_device_map.values()
            if device not in {'cpu', 'disk'}
        }
        is_model_parallel = len(devices) > 1

        if not is_model_parallel:
            model.resize_token_embeddings(new_num_embeddings)

            init_new_embeddings(
                model.get_input_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            init_new_embeddings(
                model.get_output_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            
resize_tokenizer_embedding(tokenizer=tokenizer, model=model)

Traceback

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Gaiejj · 2024-07-26T15:28:02Z

This issue is resolved by huggingface/transformers#32214! Thanks to the transformers contributors!

Gaiejj added the bug Something isn't working label Jul 25, 2024

Gaiejj self-assigned this Jul 25, 2024

Gaiejj closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

Gaiejj commented Jul 25, 2024 •

edited

Loading

Gaiejj commented Jul 26, 2024

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

Comments

Gaiejj commented Jul 25, 2024 • edited Loading

Required prerequisites

What version of align-anything are you using?

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Gaiejj commented Jul 26, 2024

Gaiejj commented Jul 25, 2024 •

edited

Loading