-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace tokenizer with processor #955
Conversation
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make it very clear:
- What a processor is vs a tokenizer
- If either/or can be provided and in what cases
""" | ||
Loads datasets for each flow based on data_args, stores a Dataset for each | ||
enabled flow in self.datasets | ||
|
||
:param tokenizer: tokenizer to use for dataset tokenization | ||
""" | ||
if self._data_args.dataset is None: | ||
self.tokenizer = self._model_args.tokenizer | ||
self.processor = self._model_args.processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we're keeping the tokenizer in the model_args as well? What if both are specified? Or only tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the newly added model args handling logic
def initialize_processor_from_path( | ||
model_args: ModelArguments, model: PreTrainedModel, teacher: PreTrainedModel | ||
) -> Processor: | ||
processor_src = model_args.processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, what if a tokenizer is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the newly added model args handling logic
@dsikka The current strategy is to treat all possible tokenizers as a subset of all possible processors, as type-defed here Processor = Union[
PreTrainedTokenizer, BaseImageProcessor, FeatureExtractionMixin, ProcessorMixin
] We should continue to support the # silently assign tokenizer to processor
if model_args.tokenizer:
if model_args.processor:
raise ValueError("Cannot use both a tokenizer and processor")
model_args.processor = model_args.tokenizer
model_args.tokenizer = None |
I think this is fine. My two comments about clarity were specific to being clear towards users - either in the model_args or through text_generation.py script |
I think this should be clear enough messaging without being annoying/verbose |
Oh sorry, missed the help text. |
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
* remove sparseml utilities Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * use in model_load Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove use of RECIPE FILE NAME Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * rename to RECIPE_FILE_NAME, avoid circular import Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * remove qa ignore Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * replace tokenizer with processor Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * defer data collator changes Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Purpose
Prerequisites
Postrequisites
Changes
processor
pathway argument to whichtokenizer
is internally reassigned tosrc/llmcompressor/typing.py
src/llmcompressor/transformers/finetune/data/base.py
,src/llmcompressor/transformers/finetune/data/ultrachat_200k.py
,src/llmcompressor/transformers/finetune/session_mixin.py
Testing