Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hotfix] fix parameter shape checking #6124

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

flybird11111
Copy link
Contributor

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs
  • I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@flybird11111 flybird11111 requested a review from a team as a code owner November 11, 2024 10:02
@@ -111,11 +111,6 @@ def search_tp_partition_dim(current_shape: torch.Size, original_shape: torch.Siz
if length > current_shape[dim]:
partition_dim = dim
break
if partition_dim is not None:
assert (
original_shape[partition_dim] == tp_size * current_shape[partition_dim]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_shape = [128256, 4096]
current_shape = [16064, 4096]
tp_size = 8

16064 * 8 = 128512 != 128256

When I delete 114~118, the size of model.embed_tokens.weight is [16064, 4096] in the saved model .

if partition_dim is not None:
assert (
original_shape[partition_dim] == tp_size * current_shape[partition_dim]
), f"The parameter isn't evenly distributed among tensor parallel group: \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_shape should be [16032, 4096]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_shape refers to the shape before padding, while tp_size * current_shape refers to the shape after padding. Therefore, the assert will throw an error.

@cingtiye
Copy link

why make_vocab_size_divisible_by is 64.
When tensor_parallel_size = 8, num_embedding = 128256, make_vocab_size_divisible_by * tensor_parallel_size = 64 * 8 = 512,
num_embeddings % multiple = 128256 % 512 != 0, so the num_embeddings is wrong!

class VocabParallelEmbedding1D(PaddingParallelModule):
    r"""Embedding parallelized in the vocabulary dimension.

    Args:
        num_embeddings (int): number of embeddings.
        embedding_dim (int): dimension of embedding.
        padding_idx (int, optional): If specified, the entries at padding_idx do not contribute to the gradient;
            therefore, the embedding vector at padding_idx is not updated during training,
            i.e. it remains as a fixed “pad”, defaults to None.
        dtype (:class:`torch.dtype`, optional): The dtype of parameters, defaults to None.
        weight_initializer (:class:`typing.Callable`, optional):
            he initializer of weight, defaults to normal initializer.

    The ``args`` and ``kwargs`` used in :class:``torch.nn.functional.embedding`` should contain:
    ::
        max_norm (float, optional): If given, each embedding vector with norm larger than max_norm is
                    renormalized to have norm max_norm. Note: this will modify weight in-place.
        norm_type (float, optional): The p of the p-norm to compute for the max_norm option. Default 2.
        scale_grad_by_freq (bool, optional): If given, this will scale gradients by the inverse
                    of frequency of the words in the mini-batch. Default False.
        sparse (bool, optional): If True, gradient w.r.t. weight will be a sparse tensor. Default False.

    More details about ``args`` and ``kwargs`` could be found in
    `Embedding <https://pytorch.org/docs/stable/generated/torch.nn.functional.embedding.html#torch.nn.functional.embedding>`_.

    More details about initializer please refer to
    `init <https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/nn/init.py>`_.
    """

    def __init__(
        self,
        num_embeddings: int,
        embedding_dim: int,
        padding_idx: int = None,
        dtype: torch.dtype = None,
        device: torch.device = None,
        process_group: ProcessGroup = None,
        weight: Optional[nn.Parameter] = None,
        weight_initializer: Callable = init.normal_(),
        make_vocab_size_divisible_by: int = 64,
        fp8_communication: bool = False,
        *args,
        **kwargs,
    ):
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.embed_args = args
        self.embed_kwargs = kwargs
        self.process_group = process_group
        self.fp8_communication = fp8_communication

        tensor_parallel_size = dist.get_world_size(group=process_group)
        tensor_parallel_rank = dist.get_rank(group=process_group)

        # generate weight and bias
        if weight is None:
            factory_kwargs = {"device": device, "dtype": dtype}
            weight = nn.Parameter(torch.empty((num_embeddings, self.embedding_dim), **factory_kwargs))
        else:
            weight.data = weight.data.to(device=device, dtype=dtype)

        # calculate new padding size
        multiple = make_vocab_size_divisible_by * tensor_parallel_size
        if num_embeddings % multiple != 0:
            self.num_embeddings = num_embeddings + multiple - (num_embeddings % multiple)

@cingtiye
Copy link

Propose to revise the following code:
https://github.com/hpcaitech/ColossalAI/blob/a2596519fd8f25da13da622e5188b9a18024f3c0/colossalai/shardformer/layer/embedding.py
Line 186 and Line 304

insert raise VallueError, and delete self.num_embeddings = ( num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)

if num_embeddings % make_vocab_size_divisible_by != 0:
              raise VallueError
#            self.num_embeddings = (
                num_embeddings + make_vocab_size_divisible_by - (num_embeddings % make_vocab_size_divisible_by)
            )

@flybird11111
Copy link
Contributor Author

flybird11111 commented Nov 18, 2024

make_vocab_size_divisible_by

The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.

@cingtiye
Copy link

make_vocab_size_divisible_by

The make_vocab_size_divisible_by is used to ensure that the vocabulary size is divisible by 64. This is because torch.mm will select a faster operator when the size is a multiple of 64. Later, the result will be unpadded.

The issue has been closed. Could you please look at the #6110?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants