[P0] Make make_last_position_supervised_data_module
parallelizable to speed up processing!
#85
Labels
enhancement
New feature or request
Hey team,
I am having issues with large datasets (~10k samples or more).
Calling the
make_last_position_supervised_data_module
function is slower than the training itself. The root cause is that the function uses a for loop to process each sample individually: link.Instead of processing samples individually, we could perform this operation in batch mode. For example, we could use "batch mapping" as described here: Hugging Face Documentation.
Could we add an option to perform this operation in batch mode?
I am happy to send a PR with this change.
The text was updated successfully, but these errors were encountered: