[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85

truskovskiyk · 2024-05-13T01:32:08Z

Hey team,

I am having issues with large datasets (~10k samples or more).

Calling the make_last_position_supervised_data_module function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.

Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.

Could we add an option to perform this operation in batch mode?

I am happy to send a PR with this change.

The text was updated successfully, but these errors were encountered:

frankaging · 2024-05-13T02:53:52Z

@truskovskiyk thanks! feel free to submit a PR for that --- that would be great!

frankaging · 2024-05-13T02:54:23Z

priority set to P0, and assign to @truskovskiyk for the PR.

frankaging assigned truskovskiyk May 13, 2024

frankaging added the enhancement New feature or request label May 13, 2024

frankaging changed the title ~~Make make_last_position_supervised_data_module parallelizable to speed up processing!~~ [P0] Make make_last_position_supervised_data_module parallelizable to speed up processing! May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85

truskovskiyk commented May 13, 2024

frankaging commented May 13, 2024

frankaging commented May 13, 2024

[P0] Make make_last_position_supervised_data_module parallelizable to speed up processing! #85

[P0] Make make_last_position_supervised_data_module parallelizable to speed up processing! #85

Comments

truskovskiyk commented May 13, 2024

frankaging commented May 13, 2024

frankaging commented May 13, 2024

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85

[P0] Make `make_last_position_supervised_data_module` parallelizable to speed up processing! #85