[SDK] Add information about TrainingClient logging #1973

andreyvelich · 2024-01-03T19:32:24Z

Fixes: #1946.

~~As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.~~
If users want to see all messages from TrainingClient APIs, they can override the default logger config in their program:

log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG)

Also, I fixed get_pytorchjob_template func, since we just need to check if num_procs_per_worker and num_worker_replicas is set before using it.

UPDATE: After discussion on this PR, we decided to use print() for the messages that users are required to see while using SDK.

/assign @droctothorpe @johnugeorge @tenzen-y @deepanker13

/hold for the review

google-oss-prow · 2024-01-03T19:32:28Z

@andreyvelich: GitHub didn't allow me to assign the following users: droctothorpe, deepanker13.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Fixes: #1946.

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.
If users want to see all messages from TrainingClient APIs, they can override the default logger config in their program:
log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG)
Also, I fixed get_pytorchjob_template func, since we just need to check if num_procs_per_worker and num_worker_replicas is set before using it.

/assign @droctothorpe @johnugeorge @tenzen-y @deepanker13

/hold for the review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2024-01-03T19:36:54Z

Pull Request Test Coverage Report for Build 7423527274

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.07%) to 42.828%

Totals
Change from base Build 7285460954:	-0.07%
Covered Lines:	3750
Relevant Lines:	8756

💛 - Coveralls

johnugeorge · 2024-01-04T11:19:30Z

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

tenzen-y · 2024-01-04T13:16:55Z

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

I would like to ask what @droctothorpe feels about this. It might be output logs at a debug level.

andreyvelich · 2024-01-04T15:52:30Z

As we discussed, in the issue for get_job_logs API, user will see pod logs only if follow=True and we use logger.info() to print it.

Won't this be a bad experience for a user who is trying get_job_logs when follow is True? Is there a need to remove it?

@johnugeorge Why it is bad experience ? Users can use this parameter to get logs in StdOut and use it in their program to track until Job is complete.
If they just want to get the current logs they can always use get_job_logs without follow parameter.

johnugeorge · 2024-01-04T17:06:12Z

With this PR, user won't see any output when they call get_job_logs with follow parameter. How will they get notified to use logger.info() in such a case ?

andreyvelich · 2024-01-04T17:26:22Z

With this PR, user won't see any output when they call get_job_logs with follow parameter. How will they get notified to use logger.info() in such a case ?

Why users won't see the output ? I just test it and it worked for me.
As you can see in my PR, I modified the default logging level for our logger:

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

I mentioned above: #1973 (comment) that if user wants to also see debug logs, they can modify the kubeflow.training.api.training_client logger.

droctothorpe · 2024-01-04T18:04:10Z

Here's the general principle directly from the official Python docs:

Configuring logging by adding handlers, formatters and filters is the responsibility of the application developer, not the library developer. If you are maintaining a library, ensure that you don’t add handlers to any of your loggers other than a NullHandler instance.

-- https://docs.python.org/3/howto/logging-cookbook.html#adding-handlers-other-than-nullhandler-to-a-logger-in-a-library

For example, imagine a training operator SDK consumer sets their log level to WARNING. That configuration will effectively be ignored by this logging handler and INFO statements will nevertheless print because log level inheritance is overridden by the handler instantiation and call to setLevel.

That being said, the above criticism applies to using print as well! When follow is set to True, the library kind of temporarily becomes a CLI rather than an SDK, and since the CLI is the application, not just the consumed library, it makes sense for it to set the logging level.

HOWEVER (so many twists and turns), this specific logger is used in multiple other places within this module, e.g. here. That's a problem. The debug logs will never print, even if the consumer sets the logging level to debug, because the call to setLevel in this module overrides that.

IMO, you can add another handle just for follow=True to this module, but sticking with print is probably simpler.

Here's another link from the Python docs with additional guidelines:
https://docs.python.org/3/howto/logging.html#library-config

Sorry for the wall of text! Take the above with a grain of salt. I'm not an expert in this area.

andreyvelich · 2024-01-04T20:40:59Z

The debug logs will never print, even if the consumer sets the logging level to debug, because the call to setLevel in this module overrides that.

@droctothorpe Is it actually true? I guess, user can override logging level in their application as follows:

from kubeflow.training import TrainingClient
import logging

def test_job():
   ....
log = logging.getLogger("kubeflow.training.api.training_client")
log.setLevel(logging.DEBUG) 
cl.create_job(name="kubeflow-job", train_func=test_job, num_worker_replicas=1)

...
$ DEBUG:kubeflow.training.api.training_client: PyTorchJob default/kubeflow-job has been created

That being said, the above criticism applies to using print as well! When follow is set to True, the library kind of temporarily becomes a CLI rather than an SDK, and since the CLI is the application, not just the consumed library, it makes sense for it to set the logging level.

I guess, in the future our Python SDK can perform some more application-level actions. For example, build training image on a fly before creating PyTorchJob as discussed here: #1878
How we should send user useful information in that case.

droctothorpe · 2024-01-04T21:18:36Z

@droctothorpe Is it actually true?

I think so, though we can always test to confirm. It's why the Python docs say this:

Note It is strongly advised that you do not add any handlers other than NullHandler to your library’s loggers. This is because the configuration of handlers is the prerogative of the application developer who uses your library. The application developer knows their target audience and what handlers are most appropriate for their application: if you add handlers ‘under the hood’, you might well interfere with their ability to carry out unit tests and deliver logs which suit their requirements.

This is an oddly contentious topic. Python purists will argue that libraries should never include handlers. But SDKs like the training operator SDK are often meant to be used interactively inside of Jupyter notebooks in a way that's not so different from how CLIs operate, i.e. the library IS the application.

How we should send user useful information in that case.

I think one option is to have a dedicated logger that (this part is key) isn't used in any other context with a dedicated handler. Another option is to use print. That's what KFP does in a couple of places, such as here. You can override a handler's level like you demonstrated, which I think might give the dedicated handler pattern the edge.

andreyvelich · 2024-01-04T21:48:24Z

I think so, though we can always test to confirm. It's why the Python docs say this:

I've tested it, and I can override the SDK logging level in my application as I messaged above.

When you say that we specify handlers for the logger, please can you explain what exactly do you mean ?
e.g. the following code returns empty list for me:

log = logging.getLogger("kubeflow.training.api.training_client")
print(log.handlers)

$ []

droctothorpe · 2024-01-04T22:04:10Z

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)

logger.debug("This debug log should print")
logger_info.debug("This debug log should not print because we're overriding the application level handler level")
logger_info.info("This info log should print")

>>> DEBUG:__main__:This debug log should print
>>> INFO:__main__info:This info log should print

You want the application developer to be able to control the logging level globally, except in the very specific case of logger_info, which is analogous to print. In the above example, the second to last line (logger_info.debug) doesn't print to stdout, despite the fact that logging.basicConfig(level=logging.DEBUG) is set. If you change the code as follows:

import logging

logging.basicConfig(level=logging.DEBUG)

logger = logging.getLogger(__name__)
logger_info = logging.getLogger(__name__ + "info")
logger_info.setLevel(logging.INFO)

logger.debug("This debug log should print")
logger.debug("This debug log should print because this logger responds to the logging level set in basicConfig.")
logger_info.info("This info log should print")

>>> DEBUG:__main__:This debug log should print
>>> DEBUG:__main__:This debug log should print because this logger responds to the logging level set in basicConfig.
>>> INFO:__main__info:This info log should print

The second to last line prints after all, i.e. application level control is respected. All of which is to say that if you're going to set logging level in the handler, just make sure to use it only where it's strictly necessary / intentional, and use a logger that responds to global logging levels (via inheritance) everywhere else.

andreyvelich · 2024-01-04T22:42:18Z

That's sound good, thanks for the clarification @droctothorpe.
Let's keep print() to log application-level logs for now. If users want to run SDK in debug mode, they can always configure their logging as I described in the TrainingClient constructor.
We can evaluate with our users to see if we need to change/improve it.

Please take a look at the docs updates.

tenzen-y

/lgtm
/assign @droctothorpe

sdk/python/kubeflow/training/api/training_client.py

droctothorpe · 2024-01-05T04:48:01Z

sdk/python/kubeflow/training/api/training_client.py

+        """Get the logs for every Training Job pod. By default it returns logs from
+        the `master` pod. Logs are returned in this format: { "pod-name": "Log data" }.
+
+        If follow = True, this function prints logs to StdOut and returns None.


What if a user wants to follow and interact with the response once the job completes? Any reason not to return the logs_dict even if follow is set to True?

What if a user wants to follow and interact with the response once the job completes?

The pods will be garbage collected based on the job's some runPolicy such as cleanup policy.
Additionally, we cannot guarantee to get logs from completed pods since the pods are removed for some reasons such as node disruption.

I agree, let's return logs if follow=True as well.
Despite on the runPolicy if would be useful for user to get the final results after running get_job_logs.
I made that change @tenzen-y @droctothorpe.

google-oss-prow · 2024-01-05T05:41:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, droctothorpe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sdk/python/OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Co-authored-by: Alex <mythicalsunlight@gmail.com>

andreyvelich · 2024-01-05T15:18:11Z

/assign @tenzen-y @droctothorpe

andreyvelich · 2024-01-05T15:24:59Z

/hold cancel

tenzen-y

/lgtm

andreyvelich added 2 commits January 3, 2024 17:30

[SDK] Use logging to print Job logs

ae532cc

Update func description

07dfc84

google-oss-prow bot assigned johnugeorge Jan 3, 2024

google-oss-prow bot added the do-not-merge/hold label Jan 3, 2024

google-oss-prow bot assigned tenzen-y Jan 3, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing January 3, 2024 19:32

google-oss-prow bot added size/S approved labels Jan 3, 2024

Add info to configure logging in debug mode

804e8b7

andreyvelich changed the title ~~[SDK] Use logging to print Job logs~~ [SDK] Add information about TrainingClient logging Jan 4, 2024

tenzen-y reviewed Jan 5, 2024

View reviewed changes

google-oss-prow bot assigned droctothorpe Jan 5, 2024

google-oss-prow bot added the lgtm label Jan 5, 2024

droctothorpe reviewed Jan 5, 2024

View reviewed changes

sdk/python/kubeflow/training/api/training_client.py Outdated Show resolved Hide resolved

droctothorpe reviewed Jan 5, 2024

View reviewed changes

droctothorpe approved these changes Jan 5, 2024

View reviewed changes

Update sdk/python/kubeflow/training/api/training_client.py

1e66250

Co-authored-by: Alex <mythicalsunlight@gmail.com>

google-oss-prow bot removed the lgtm label Jan 5, 2024

Return logs if follow is True

29cdf51

google-oss-prow bot removed the do-not-merge/hold label Jan 5, 2024

tenzen-y reviewed Jan 5, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Jan 5, 2024

google-oss-prow bot merged commit ed3168a into kubeflow:master Jan 5, 2024
33 checks passed

andreyvelich deleted the issue-1946-use-logging-sdk branch January 5, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Add information about TrainingClient logging #1973

[SDK] Add information about TrainingClient logging #1973

andreyvelich commented Jan 3, 2024 •

edited

Loading

google-oss-prow bot commented Jan 3, 2024

coveralls commented Jan 3, 2024 •

edited

Loading

johnugeorge commented Jan 4, 2024

tenzen-y commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

johnugeorge commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

droctothorpe commented Jan 4, 2024 •

edited

Loading

andreyvelich commented Jan 4, 2024 •

edited

Loading

droctothorpe commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

droctothorpe commented Jan 4, 2024 •

edited

Loading

andreyvelich commented Jan 4, 2024

tenzen-y left a comment

droctothorpe Jan 5, 2024

tenzen-y Jan 5, 2024 •

edited

Loading

andreyvelich Jan 5, 2024

google-oss-prow bot commented Jan 5, 2024

andreyvelich commented Jan 5, 2024

andreyvelich commented Jan 5, 2024

tenzen-y left a comment

[SDK] Add information about TrainingClient logging #1973

[SDK] Add information about TrainingClient logging #1973

Conversation

andreyvelich commented Jan 3, 2024 • edited Loading

google-oss-prow bot commented Jan 3, 2024

coveralls commented Jan 3, 2024 • edited Loading

Pull Request Test Coverage Report for Build 7423527274

Warning: This coverage report may be inaccurate.

💛 - Coveralls

johnugeorge commented Jan 4, 2024

tenzen-y commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

johnugeorge commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

droctothorpe commented Jan 4, 2024 • edited Loading

andreyvelich commented Jan 4, 2024 • edited Loading

droctothorpe commented Jan 4, 2024

andreyvelich commented Jan 4, 2024

droctothorpe commented Jan 4, 2024 • edited Loading

andreyvelich commented Jan 4, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

droctothorpe Jan 5, 2024

Choose a reason for hiding this comment

tenzen-y Jan 5, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich Jan 5, 2024

Choose a reason for hiding this comment

google-oss-prow bot commented Jan 5, 2024

andreyvelich commented Jan 5, 2024

andreyvelich commented Jan 5, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

andreyvelich commented Jan 3, 2024 •

edited

Loading

coveralls commented Jan 3, 2024 •

edited

Loading

droctothorpe commented Jan 4, 2024 •

edited

Loading

andreyvelich commented Jan 4, 2024 •

edited

Loading

droctothorpe commented Jan 4, 2024 •

edited

Loading

tenzen-y Jan 5, 2024 •

edited

Loading