Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Processing Args #1006

Draft
wants to merge 56 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a62617c
clean up CustomDataset
kylesayrs Nov 28, 2024
57b5e02
chchchchanges
kylesayrs Nov 29, 2024
fa317fd
wip: use rename to processor, going through tests
kylesayrs Dec 2, 2024
f3f5875
remove labels from calibration dataset rather than assuming that all …
kylesayrs Dec 2, 2024
58c3afe
cleanup
kylesayrs Dec 2, 2024
72aecfc
cleanup, etc
kylesayrs Dec 2, 2024
77217fb
Merge remote-tracking branch 'origin' into kylesayrs/cleanup-custom-d…
kylesayrs Dec 2, 2024
4461a3e
fix typehinting
kylesayrs Dec 2, 2024
fb33001
add typechecking imports
kylesayrs Dec 2, 2024
bf4744a
remove sparseml utilities
kylesayrs Dec 3, 2024
62ae31d
Merge branch 'kylesayrs/remove-sparseml-utilities' into kylesayrs/cle…
kylesayrs Dec 3, 2024
7e516c1
use in model_load
kylesayrs Dec 3, 2024
9e33641
remove use of RECIPE FILE NAME
kylesayrs Dec 3, 2024
58c0fba
rename to RECIPE_FILE_NAME, avoid circular import
kylesayrs Dec 3, 2024
b28aaae
Merge branch 'kylesayrs/remove-sparseml-utilities' into kylesayrs/cle…
kylesayrs Dec 3, 2024
8d13013
image dataset collation
kylesayrs Dec 3, 2024
163ee8f
cleanup, do not handle case where processor is None
kylesayrs Dec 3, 2024
1180b34
remove qa ignore
kylesayrs Dec 3, 2024
ad20ae7
Merge branch 'kylesayrs/remove-sparseml-utilities' into kylesayrs/cle…
kylesayrs Dec 3, 2024
c431958
add documentation
kylesayrs Dec 3, 2024
b48d55d
add data collator arg
kylesayrs Dec 3, 2024
0ed5c2c
use default factor
kylesayrs Dec 3, 2024
4576712
validate flickr
kylesayrs Dec 4, 2024
5276c58
discover bug, tests and multimodal working
kylesayrs Dec 4, 2024
dffcbc3
dataset split fallbacks
kylesayrs Dec 4, 2024
779c9a2
Merge branch 'kylesayrs/dataset-split-fallbacks' into kylesayrs/clean…
kylesayrs Dec 4, 2024
d061567
cleanup, depreciate remove_columns argument
kylesayrs Dec 4, 2024
55a31ca
silently assign tokenizer to processor
kylesayrs Dec 5, 2024
1aba16d
replace tokenizer with processor
kylesayrs Dec 5, 2024
135e459
Merge branch 'kylesayrs/processor-replaces-tokenizer' into kylesayrs/…
kylesayrs Dec 5, 2024
bc505bf
typehinting, add not-implemented error
kylesayrs Dec 5, 2024
c91ba77
remove todos
kylesayrs Dec 5, 2024
0a573a1
update dataset manager api in tests
kylesayrs Dec 5, 2024
acb1a18
Delete examples/multimodal_vision/qwen_vl2.py
kylesayrs Dec 5, 2024
56b5d12
Delete examples/multimodal_vision/mllama.py
kylesayrs Dec 5, 2024
c1f5cb2
Merge remote-tracking branch 'origin' into kylesayrs/cleanup-custom-d…
kylesayrs Dec 9, 2024
7667998
handle columns better
kylesayrs Dec 17, 2024
af86f45
filter_tokenizer_args
kylesayrs Dec 18, 2024
0438e17
Merge remote-tracking branch 'origin' into kylesayrs/cleanup-custom-d…
kylesayrs Dec 18, 2024
f4fa9c3
more tests
kylesayrs Dec 18, 2024
6bd1721
Merge remote-tracking branch 'origin' into kylesayrs/cleanup-custom-d…
kylesayrs Dec 18, 2024
e757e61
remove duplicate file
kylesayrs Dec 18, 2024
bdfa3d4
better help texts
kylesayrs Dec 18, 2024
601cb0e
rvert data split fallbacks
kylesayrs Dec 19, 2024
7f6e8cd
handle non-fast tokenizers
kylesayrs Dec 20, 2024
3a9816c
address nits, add logging
kylesayrs Dec 20, 2024
7be0c88
add back copyrights
kylesayrs Dec 20, 2024
bedbf8c
correctly update helptext
kylesayrs Dec 20, 2024
7c54bed
Merge remote-tracking branch 'origin' into kylesayrs/cleanup-custom-d…
kylesayrs Dec 20, 2024
f2b3c99
use rename_columns and processor_kwargs args
kylesayrs Dec 20, 2024
5b4ee22
ensure prompt key
kylesayrs Dec 20, 2024
bbe3eb6
syntax
kylesayrs Dec 20, 2024
6f2a246
Merge remote-tracking branch 'origin' into kylesayrs/tokenizer-kwargs…
kylesayrs Dec 27, 2024
148e617
account for models which improperly do not override the abstract methods
kylesayrs Dec 27, 2024
ca9757d
Merge branch 'kylesayrs/patch-mal-models' into kylesayrs/tokenizer-kw…
kylesayrs Dec 27, 2024
ddc7d54
Merge branch 'main' into kylesayrs/tokenizer-kwargs-argument
kylesayrs Jan 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 20 additions & 15 deletions src/llmcompressor/transformers/finetune/data/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,27 +217,32 @@ def dataset_template(self) -> Union[Callable[[Any], Any], None]:
def rename_columns(self, dataset: DatasetType) -> DatasetType:
# rename columns to match processor/tokenizer kwargs
column_names = get_columns(dataset)
if self.data_args.text_column in column_names and "text" not in column_names:
logger.debug(f"Renaming column `{self.data_args.text_column}` to `text`")
dataset = dataset.rename_column(self.data_args.text_column, "text")
for from_, to_ in self.data_args.rename_columns:
if from_ not in column_names:
raise ValueError(
f"Cannot rename {from_} to {to_}from columns {column_names}"
)
dataset = dataset.rename_column(from_, to_)
column_names.remove(from_)
column_names.append(to_)

return dataset

def filter_tokenizer_args(self, dataset: DatasetType) -> DatasetType:
# assumes that inputs are not passed via self.processor.__call__ args and kwargs
signature = inspect.signature(self.processor.__call__)
tokenizer_args = set(
key
for key, param in signature.parameters.items()
if param.kind not in (Kind.VAR_POSITIONAL, Kind.VAR_KEYWORD)
)
logger.debug(
f"Found processor args `{tokenizer_args}`. Removing all other columns"
)
def filter_processor_args(self, dataset: DatasetType) -> DatasetType:
processor_kwargs = self.data_args.processor_kwargs
if processor_kwargs is None:
# assumes that inputs are not passed via args and kwargs
signature = inspect.signature(self.processor.__call__)
processor_kwargs = set(
key
for key, param in signature.parameters.items()
if param.kind not in (Kind.VAR_POSITIONAL, Kind.VAR_KEYWORD)
)
logger.debug(f"Found processor args `{processor_kwargs}`")

column_names = get_columns(dataset)
return dataset.remove_columns(
list(set(column_names) - set(tokenizer_args) - set([self.PROMPT_KEY]))
list(set(column_names) - set(processor_kwargs) - set([self.PROMPT_KEY]))
)

def tokenize(self, data: LazyRow) -> Dict[str, Any]:
Expand Down
25 changes: 23 additions & 2 deletions src/llmcompressor/transformers/finetune/data/data_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,35 @@ class CustomDataTrainingArguments(DVCDatasetTrainingArguments):
metadata={
"help": (
"Optional key to be used as the `text` input to tokenizer/processor "
"after dataset preprocesssing"
"after dataset preprocesssing (deprecated, please use "
"`rename_columns` instead)"
)
},
)

remove_columns: Union[None, str, List] = field(
default=None,
metadata={"help": "Column names to remove after preprocessing (deprecated)"},
metadata={
"help": (
"Column names to remove after preprocessing (deprecated, please use "
"`rename_columns` instead)"
)
},
)

rename_columns: Optional[Dict[str, str]] = field(
default_factory=dict,
metadata={
"help": "Optional mapping to rename dataset columns after preprocessing"
},
)

tokenizer_kwargs: Optional[List[str]] = field(
default=None, metadata={"help": "Alias for `processor_kwargs`"}
)

processor_kwargs: Optional[List[str]] = field(
default=None, metadata={"help": "Optional list of processor argument names"}
)

preprocessing_func: Union[None, str, Callable] = field(
Expand Down
16 changes: 14 additions & 2 deletions src/llmcompressor/transformers/finetune/text_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,14 @@ def parse_args(**kwargs):
# raise depreciation warnings
if data_args.remove_columns is not None:
warnings.warn(
"`remove_columns` argument is depreciated. When tokenizing datasets, all "
"columns which are invalid inputs the tokenizer will be removed",
"`remove_columns` is depreciated, please use `rename_columns` and "
"`processor_kwargs` instead",
DeprecationWarning,
)

if data_args.text_column is not None:
warnings.warn(
"`text_column` is depreciated, please use `rename_columns` instead",
DeprecationWarning,
)

Expand All @@ -157,6 +163,12 @@ def parse_args(**kwargs):
model_args.processor = model_args.tokenizer
model_args.tokenizer = None

if data_args.tokenizer_kwargs:
if data_args.processor_kwargs:
raise ValueError("Cannot use both a tokenizer and processor")
data_args.processor_kwargs = data_args.tokenizer_kwargs
data_args.tokenizer_kwargs = None

return model_args, data_args, training_args


Expand Down
Loading