chore: merge set of changes for v2.3.0 #428

aluu317 · 2024-12-23T14:54:14Z

Description of the change

Related issue number

How to verify the PR

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Code to perform dataset sampling via sampling probabilities in data Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

* Expose additional data handlers as an argument to the train function. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

@willmj

#399) * fix: set legacy behavior to false, enable new behavior Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Resolve push_to_hub_token warning Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Remove max_seq_length and dataset_text_field from SFTTrainer Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Resolve tokenizer.padding_side warning Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * nit: restructure warning fixes Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fix: Add packing directly to SFTConfig Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * fmt Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * Removed dataset_kwargs from SFTTrainer Removed the argument dataset_kwargs from the the invocation of SFTTRainer() because it will be deprecated in V1.0.0. Instead, dataset_kwargs have been added as a key to the training_args variable. Following the example provided by HF found here: https://huggingface.co/docs/trl/en/sft_trainer#training-the-vision-language-model Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * fix: Added max_seq_length back to SFTConfig() Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * Removed legacy and padding_side args Removed these args as they were based on changes from @willmj that haven't been approved yet Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * Moved all args to additional_args Following @kmehant suggestion. Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * Removed packing and max_seq_length Removed packing and max_seq_length variables from additional_args Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * Removed check is_pretokenized_dataset Co-authored-by: Mehant Kammakomati <kmehant@gmail.com> Signed-off-by: Luka-D <56648891+Luka-D@users.noreply.github.com> * Removed max_seq_length from additional_args Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * Removed error.log Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> * fix: move packing to SFTConfig as well Co-authored-by: Luka-D <56648891+Luka-D@users.noreply.github.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Luka Dojcinovic <56648891+Luka-D@users.noreply.github.com> Signed-off-by: Luka-D <56648891+Luka-D@users.noreply.github.com> Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Mehant Kammakomati <kmehant@gmail.com> Co-authored-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

…les (#418) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

…ts (#412) * test: Add unit tests to test multiple files in single/multiple datasets Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * e2e testing unit test for multiple datasets with multiple files Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * test: multiple datasets with multiple datafiles column names Signed-off-by: Will Johnson <mwjohnson728@gmail.com> * PR changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * PR Changes Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * fix: fmt Signed-off-by: Abhishek <maurya.abhishek@ibm.com> * Merge test_process_dataconfig_multiple_files_varied_data_formats Signed-off-by: Abhishek <maurya.abhishek@ibm.com> --------- Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Will Johnson <mwjohnson728@gmail.com>

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Also add mlflow docs and add mlflow to docker file and as optional requirement Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

feat: Integrate MLflow tracker

…atterns, HF Dataset and combination (#424) Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

github-actions · 2024-12-23T14:54:25Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Abhishek-TAMU · 2024-12-23T15:41:01Z

The commits looks good to me. After addition of this one more PR, looks good to merge.

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com> Signed-off-by: Will Johnson <mwjohnson728@gmail.com> Signed-off-by: Abhishek <maurya.abhishek@ibm.com> Co-authored-by: Will Johnson <mwjohnson728@gmail.com> Co-authored-by: Abhishek <maurya.abhishek@ibm.com>

Abhishek-TAMU and others added 11 commits December 7, 2024 13:45

feat: Add support to handle Parquet Dataset files via data config (#401)

fbe6064

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

test: add arrow datasets and arrow unit tests (#403)

e6f7a22

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

Perform dataset mixing via sampling probabilities in data config (#408)

4168c87

Code to perform dataset sampling via sampling probabilities in data Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

feat: Expose additional data handlers as an argument in train (#409)

689ee41

* Expose additional data handlers as an argument to the train function. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

fix: update dataclass objects directly instead of creating new variab…

a89f76b

…les (#418) Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

feat: Add multi and single turn chat support (#415)

42e3077

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Add mlflow tracker and unit testing for the same.

cba85de

Also add mlflow docs and add mlflow to docker file and as optional requirement Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>

Merge pull request #425 from dushyantbehl/mlflow-integration

003041f

feat: Integrate MLflow tracker

feat: Handle passing of multiple files, multiple folders, path with p…

d7f06f5

…atterns, HF Dataset and combination (#424) Signed-off-by: Abhishek <maurya.abhishek@ibm.com>

aluu317 requested review from anhuong, Ssukriti, fabianlim and kmehant as code owners December 23, 2024 14:54

aluu317 changed the title ~~release: merge set of changes for v2.3.0~~ chore: merge set of changes for v2.3.0 Dec 23, 2024

github-actions bot added the chore label Dec 23, 2024

aluu317 merged commit 3ec30a0 into release Dec 23, 2024
14 of 15 checks passed

aluu317 deleted the new_release_2.3.0 branch December 23, 2024 16:27

aluu317 had a problem deploying to pypi December 23, 2024 16:51 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: merge set of changes for v2.3.0 #428

chore: merge set of changes for v2.3.0 #428

aluu317 commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

Abhishek-TAMU commented Dec 23, 2024

chore: merge set of changes for v2.3.0 #428

chore: merge set of changes for v2.3.0 #428

Conversation

aluu317 commented Dec 23, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions bot commented Dec 23, 2024

Abhishek-TAMU commented Dec 23, 2024