Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading issue with TensorFlow and PyTorch dataloader #44

Merged
merged 16 commits into from
Mar 17, 2023

Conversation

hariharan-devarajan
Copy link
Collaborator

@hariharan-devarajan hariharan-devarajan commented Mar 10, 2023

For both data loaders, multi-threading needed more implementation.

  • for TensorFlow, we need to include reader threads and then divide samples between ranks and reader threads. Fixes Multi-threaded reading for Tensorflow #42
  • Same for PyTorch, we need to init the worker correctly otherwise, it uses just one worker. Fixes PyTorch no profiling data from other I/O threads. #43
  • Additionally, fixed support to have shared or multi-file work till we have total samples more than comm size * threads * batch size Fixes Fix single file sharing case Bug. #40
  • Distribution of files and samples was happening in the reader and data loader, which adds a lot of overhead to reading. Instead, this is done as a part of the epoch loop before the reading starts. This reduces the cost of operations during reading to be only I/O. This simplified several pieces of code in the data loader and readers. Fixes Error on running dlio_benchmark #46

@hariharan-devarajan
Copy link
Collaborator Author

Pytorch
Tensorflow

@hariharan-devarajan hariharan-devarajan force-pushed the main branch 2 times, most recently from d31593c to 6b655aa Compare March 10, 2023 09:29
@hariharan-devarajan hariharan-devarajan changed the title Multithreading issue with TF and PyT data loader Multithreading issue with TensorFlow and PyTorch dataloader Mar 10, 2023
@hariharan-devarajan hariharan-devarajan force-pushed the main branch 2 times, most recently from 05dbf87 to 66f5d5e Compare March 10, 2023 11:33
replace glob with pathlib as it is not able to see recent files.
# Conflicts:
#	src/utils/config.py
Copy link
Member

@zhenghh04 zhenghh04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this PR with some changes. Still need to think about the flie map to different threads

Comment on lines 69 to 73
self.output_folder = self.args.output_folder
self.logfile = os.path.join(self.output_folder, self.args.log_file)
self.data_folder = self.args.data_folder
os.makedirs(self.output_folder, exist_ok=True)
self.storage_root = self.args.storage_root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this after line 81

@zhenghh04 zhenghh04 merged commit b89eb3d into argonne-lcf:main Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants