-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Add information about TrainingClient logging #1973
[SDK] Add information about TrainingClient logging #1973
Conversation
@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 7423527274Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
Won't this be a bad experience for a user who is trying |
I would like to ask what @droctothorpe feels about this. It might be output logs at a debug level. |
@johnugeorge Why it is bad experience ? Users can use this parameter to get logs in StdOut and use it in their program to track until Job is complete. |
With this PR, user won't see any output when they call |
Why users won't see the output ? I just test it and it worked for me.
I mentioned above: #1973 (comment) that if user wants to also see |
Here's the general principle directly from the official Python docs:
For example, imagine a training operator SDK consumer sets their log level to That being said, the above criticism applies to using HOWEVER (so many twists and turns), this specific logger is used in multiple other places within this module, e.g. here. That's a problem. The IMO, you can add another handle just for Here's another link from the Python docs with additional guidelines: Sorry for the wall of text! Take the above with a grain of salt. I'm not an expert in this area. |
@droctothorpe Is it actually true? I guess, user can override logging level in their application as follows:
I guess, in the future our Python SDK can perform some more application-level actions. For example, build training image on a fly before creating PyTorchJob as discussed here: #1878 |
I think so, though we can always test to confirm. It's why the Python docs say this:
This is an oddly contentious topic. Python purists will argue that libraries should never include handlers. But SDKs like the training operator SDK are often meant to be used interactively inside of Jupyter notebooks in a way that's not so different from how CLIs operate, i.e. the library IS the application.
I think one option is to have a dedicated logger that (this part is key) isn't used in any other context with a dedicated handler. Another option is to use |
I've tested it, and I can override the SDK logging level in my application as I messaged above. When you say that we specify handlers for the logger, please can you explain what exactly do you mean ?
|
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)
logger.debug("This debug log should print")
logger_info.debug("This debug log should not print because we're overriding the application level handler level")
logger_info.info("This info log should print")
>>> DEBUG:__main__:This debug log should print
>>> INFO:__main__info:This info log should print You want the application developer to be able to control the logging level globally, except in the very specific case of import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)
logger.debug("This debug log should print")
logger.debug("This debug log should print because this logger responds to the logging level set in basicConfig.")
logger_info.info("This info log should print")
>>> DEBUG:__main__:This debug log should print
>>> DEBUG:__main__:This debug log should print because this logger responds to the logging level set in basicConfig.
>>> INFO:__main__info:This info log should print The second to last line prints after all, i.e. application level control is respected. All of which is to say that if you're going to set logging level in the handler, just make sure to use it only where it's strictly necessary / intentional, and use a logger that responds to global logging levels (via inheritance) everywhere else. |
That's sound good, thanks for the clarification @droctothorpe. Please take a look at the docs updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @droctothorpe
"""Get the logs for every Training Job pod. By default it returns logs from | ||
the `master` pod. Logs are returned in this format: { "pod-name": "Log data" }. | ||
|
||
If follow = True, this function prints logs to StdOut and returns None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a user wants to follow and interact with the response once the job completes? Any reason not to return the logs_dict even if follow
is set to True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a user wants to follow and interact with the response once the job completes?
The pods will be garbage collected based on the job's some runPolicy such as cleanup policy.
Additionally, we cannot guarantee to get logs from completed pods since the pods are removed for some reasons such as node disruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let's return logs if follow=True
as well.
Despite on the runPolicy if would be useful for user to get the final results after running get_job_logs
.
I made that change @tenzen-y @droctothorpe.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, droctothorpe The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Co-authored-by: Alex <mythicalsunlight@gmail.com>
/assign @tenzen-y @droctothorpe |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Fixes: #1946.
As we discussed, in the issue forget_job_logs
API, user will see pod logs only iffollow=True
and we uselogger.info()
to print it.If users want to see all messages from TrainingClient APIs, they can override the default logger config in their program:
Also, I fixed
get_pytorchjob_template
func, since we just need to check ifnum_procs_per_worker
andnum_worker_replicas
is set before using it.UPDATE: After discussion on this PR, we decided to use
print()
for the messages that users are required to see while using SDK./assign @droctothorpe @johnugeorge @tenzen-y @deepanker13
/hold for the review