Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Packing for pretokenised #468

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dushyantbehl
Copy link
Collaborator

Description of the change

This PR adds support for packing for pretokenized datasets which was added in transformers>=4.46

This is based on the PR #448

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@kmehant
Copy link
Collaborator

kmehant commented Feb 13, 2025

@dushyantbehl

  1. max_seq_len is too long in your testcase that it was not able to generate not even 1 sample so it was failing
  2. you should pass seq2seq collator with padding=False when it is pretokenized but it seem to pick up completiononly collator.

@dushyantbehl dushyantbehl force-pushed the packing-for-pretokenised branch 4 times, most recently from 0c589bd to 6d4e771 Compare March 3, 2025 11:26
Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
@dushyantbehl dushyantbehl force-pushed the packing-for-pretokenised branch from 6d4e771 to 73b81c1 Compare March 3, 2025 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants