Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[P0] Make make_last_position_supervised_data_module parallelizable to speed up processing! #85

Open
truskovskiyk opened this issue May 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@truskovskiyk
Copy link

Hey team,

I am having issues with large datasets (~10k samples or more).

Calling the make_last_position_supervised_data_module function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.

Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.

Could we add an option to perform this operation in batch mode?

I am happy to send a PR with this change.

@frankaging
Copy link
Collaborator

@truskovskiyk thanks! feel free to submit a PR for that --- that would be great!

@frankaging frankaging added the enhancement New feature or request label May 13, 2024
@frankaging frankaging changed the title Make make_last_position_supervised_data_module parallelizable to speed up processing! [P0] Make make_last_position_supervised_data_module parallelizable to speed up processing! May 13, 2024
@frankaging
Copy link
Collaborator

priority set to P0, and assign to @truskovskiyk for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants