-
Notifications
You must be signed in to change notification settings - Fork 83
Don't export to collections for all workers with unsupported distrib training #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -436,12 +436,13 @@ def _initialize_writers(self, only_initialize_if_missing=False) -> None: | |
self.first_process = True | ||
self.logger.info(f"Hook is writing from the hook with pid: {os.getpid()}\n") | ||
else: | ||
if self.first_process is None: | ||
self.logger.warn( | ||
f"Unsupported Distributed Training Strategy Detected. \ | ||
Sagemaker-Debugger will only write from one process. \ | ||
The process with pid: {os.getpid()} will not be writing any data. \n" | ||
) | ||
self.first_process = False | ||
self.logger.warn( | ||
f"Unsupported Distributed Training Strategy Detected.\n\ | ||
Sagemaker-Debugger will only write from one process.\n\ | ||
The process with pid: {os.getpid()} will not be writing any data. \n" | ||
) | ||
Comment on lines
-440
to
-444
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably move this log to some other location where it is called fewer times instead of completely removing it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed this in the commit i just pushed |
||
return | ||
|
||
if self.save_all_workers is False: | ||
|
@@ -546,6 +547,13 @@ def set_mode(self, mode): | |
|
||
def export_collections(self): | ||
num_workers = self._get_num_workers() | ||
if num_workers == 1 and self.first_process is False: | ||
NihalHarish marked this conversation as resolved.
Show resolved
Hide resolved
|
||
self.logger.warn( | ||
f"Unsupported Distributed Training Strategy Detected. \ | ||
Sagemaker-Debugger will only write from one process. \ | ||
The process with pid: {os.getpid()} will not be writing any data. \n" | ||
) | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar log statement here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added in the latest commit |
||
if self.save_all_workers is False: | ||
if self.chief_worker != self.worker: | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this log to 439 and then we will be logging only once.
if self.first_process is None:
log
self.first_process= False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in the latest commit