Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace tokenizer with processor #955

Merged
merged 12 commits into from
Dec 17, 2024
Merged

Conversation

kylesayrs
Copy link
Collaborator

Purpose

  • Prepare to support processors and vision datasets
  • It's important to rename and retype variable to better reflect its more widened definition

Prerequisites

Postrequisites

Changes

  • Rename and retype instances of tokenizer to processor
  • Add processor pathway argument to which tokenizer is internally reassigned to
  • Add typing definitions in src/llmcompressor/typing.py
  • Special handling of tokenizer in src/llmcompressor/transformers/finetune/data/base.py, src/llmcompressor/transformers/finetune/data/ultrachat_200k.py, src/llmcompressor/transformers/finetune/session_mixin.py

Testing

  • No new functionality is added, CI tests should pass

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link

github-actions bot commented Dec 5, 2024

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

@kylesayrs kylesayrs self-assigned this Dec 5, 2024
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
This was referenced Dec 10, 2024
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this.

src/llmcompressor/typing.py Show resolved Hide resolved
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make it very clear:

  1. What a processor is vs a tokenizer
  2. If either/or can be provided and in what cases

"""
Loads datasets for each flow based on data_args, stores a Dataset for each
enabled flow in self.datasets

:param tokenizer: tokenizer to use for dataset tokenization
"""
if self._data_args.dataset is None:
self.tokenizer = self._model_args.tokenizer
self.processor = self._model_args.processor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're keeping the tokenizer in the model_args as well? What if both are specified? Or only tokenizer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the newly added model args handling logic

def initialize_processor_from_path(
model_args: ModelArguments, model: PreTrainedModel, teacher: PreTrainedModel
) -> Processor:
processor_src = model_args.processor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, what if a tokenizer is provided?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the newly added model args handling logic

@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Dec 16, 2024

@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here

Processor = Union[
    PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
]

We should continue to support the tokenizer model arg, but internally reassign it to the processor variable name for code simplicity.

# silently assign tokenizer to processor
if model_args.tokenizer:
    if model_args.processor:
        raise ValueError("Cannot use both a tokenizer and processor")
    model_args.processor = model_args.tokenizer
model_args.tokenizer = None

@dsikka
Copy link
Collaborator

dsikka commented Dec 16, 2024

@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here

Processor = Union[
    PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
]

We should continue to support the tokenizer model arg, but internally reassign it to the processor variable name for code simplicity.

# silently assign tokenizer to processor
if model_args.tokenizer:
    if model_args.processor:
        raise ValueError("Cannot use both a tokenizer and processor")
    model_args.processor = model_args.tokenizer
model_args.tokenizer = None

I think this is fine. My two comments about clarity were specific to being clear towards users - either in the model_args or through text_generation.py script

@kylesayrs
Copy link
Collaborator Author

@dsikka

  1. There is help text attached to the newly added processor arg which users can read
  2. We throw an error if both are passed

I think this should be clear enough messaging without being annoying/verbose

@kylesayrs kylesayrs requested a review from dsikka December 17, 2024 04:05
@dsikka
Copy link
Collaborator

dsikka commented Dec 17, 2024

@dsikka

  1. There is help text attached to the newly added processor arg which users can read
  2. We throw an error if both are passed

I think this should be clear enough messaging without being annoying/verbose

Oh sorry, missed the help text.
Sounds good

@dsikka dsikka merged commit ad972c2 into main Dec 17, 2024
6 of 7 checks passed
@dsikka dsikka deleted the kylesayrs/processor-replaces-tokenizer branch December 17, 2024 15:50
horheynm pushed a commit that referenced this pull request Dec 20, 2024
* remove sparseml utilities

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* use in model_load

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove use of RECIPE FILE NAME

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* rename to RECIPE_FILE_NAME, avoid circular import

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove qa ignore

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* replace tokenizer with processor

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* defer data collator changes

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
horheynm pushed a commit that referenced this pull request Dec 20, 2024
* remove sparseml utilities

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* use in model_load

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove use of RECIPE FILE NAME

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* rename to RECIPE_FILE_NAME, avoid circular import

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove qa ignore

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* replace tokenizer with processor

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* defer data collator changes

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
horheynm pushed a commit that referenced this pull request Dec 20, 2024
* remove sparseml utilities

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* use in model_load

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove use of RECIPE FILE NAME

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* rename to RECIPE_FILE_NAME, avoid circular import

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* remove qa ignore

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* replace tokenizer with processor

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* defer data collator changes

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants