Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support to handle Parquet Dataset files via data config #401

Merged

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Dec 3, 2024

Description of the change

Adding changes to enable loading of Parquet Dataset files by datasets.load_dataset

Related issue number

#1469

How to verify the PR

Testing with Parquet file

python tuning/sft_trainer.py  \
--model_name_or_path Maykeye/TinyLLama-v0  \
--training_data_path tests/artifacts/testdata/twitter_complaints_input_output.parquet \
--output_dir outputs/full-tuning  \
--num_train_epochs 5  \
--per_device_train_batch_size 2  \
--gradient_accumulation_steps 1  \
--learning_rate 1e-5  \
--use_flash_attn false \
--torch_dtype "float32"

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Copy link

github-actions bot commented Dec 3, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

2 similar comments
Copy link

github-actions bot commented Dec 3, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Copy link

github-actions bot commented Dec 3, 2024

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@Abhishek-TAMU Abhishek-TAMU changed the title IN PROGRESS: Changes for Parquet Dataset files in progress: Changes for Parquet Dataset files Dec 3, 2024
@Abhishek-TAMU Abhishek-TAMU changed the title in progress: Changes for Parquet Dataset files feat: Add support to handle Parquet Dataset files via data config Dec 6, 2024
@github-actions github-actions bot added the feat label Dec 6, 2024
@Abhishek-TAMU Abhishek-TAMU marked this pull request as ready for review December 6, 2024 20:21
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
@dushyantbehl
Copy link
Contributor

dushyantbehl commented Dec 7, 2024

LGTM thanks @Abhishek-TAMU

@ashokponkumar ashokponkumar merged commit fbe6064 into foundation-model-stack:main Dec 7, 2024
8 checks passed
@Abhishek-TAMU Abhishek-TAMU deleted the data_format branch December 9, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants