Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner log messages. #3647

Open
hjoliver opened this issue Jun 4, 2020 · 31 comments
Open

Cleaner log messages. #3647

hjoliver opened this issue Jun 4, 2020 · 31 comments
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Jun 4, 2020

At INFO level, we should just log the timestamped (optionally: #3646) sequence of events in the simplest way possible, unless something goes wrong. There's too much usually-irrelevant crap in there.

E.g. (with two timestamps removed):

INFO - [zz.1] status=running: (received)succeeded for job(01)

This should be something like:

INFO - zz.1 (job 01) succeeded

No need to log current status unless it is not the expected status.

And no need to log "received" either, just modify slightly for the polled case:

INFO - zz.1 (job 01) polled succeeded

Pull requests welcome!

@hjoliver hjoliver added this to the some-day milestone Jun 4, 2020
@hjoliver
Copy link
Member Author

hjoliver commented Jun 4, 2020

With timestamps, these log lines are really verbose:

2020-05-29T12:24:34+12:00 INFO - [y.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(02)

We could at least put the second timestamp ("recieved at"), which is normally the least relevant bit of info, at the end of the line:

2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded at 2020-05-29T12:24:33+12:00 

And maybe even omit it if it is sufficiently similar to the log timestamp (within a few seconds, say):

2020-05-29T12:24:34+12:00 INFO - foo.1 (job 01) succeeded

@oliver-sanders
Copy link
Member

oliver-sanders commented Jun 4, 2020

We should definitely clean up log messages, the only real barrier to doing so has been the functional tests which make exact comparisons to log messages and often do nasty things like awk and sed which makes tests hard to fix.

Here are my thoughts on cleaning the log files:

1) State change messages

  • remove time information
    • It's already in the timestamp
    • It's also recorded in the database
  • Remove pervious state
    • Unless changed unexpectedly
  • Remove task communication method
    • Unless state changed due to poll and task communication method != poll
  • Use same #NN job number format as we use in the GUI & TUI.
- 2020-05-29T12:24:34+12:00 INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-05-29T12:24:34+12:00 INFO - foo.1 #01 succeeded

2) Make submitting messages comprehendible

  • Remove <user> from user@host
    • We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).
- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 submitting on <host>

3) Only log client commands if they are actually informational

  • At the moment we log all client commands, this system is up for a re-write with the move from zmq endpoints to GraphQL schema anyway.
  • Only log mutations? We don't really care about:
    • put_messages
    • GraphQL queries (e.g. from the UI Server or Tui)
- 2020-06-04T10:14:21Z INFO - [client-command] put_messages <user>@<host>:cylc-message
- 2020-06-04T10:14:26Z INFO - [client-command] graphql <user>@<host>.local:cylc-tui

4) Consider demoting triggering items to DEBUG

  • Users don't care about triggering.
  • For the most part neither do we!
- 2020-06-04T10:18:15Z INFO - [foot.20191209T1200Z] -triggered off []
+ 2020-06-04T10:18:15Z DEBUG - foot.20191209T1200Z - triggered off []

5) Demote task failure messages from CRITICAL to INFO

  • Failed is no more important than succeeded if the failure is handled correctly.
  • If the failure is not handled we will get a suite stalled warning anyway.
  • Remove the duplicate CRITICAL message whilst we are at it.
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)

6) Consider demoting task messages to a new log level

  • Create a new logging level between DEBUG and INFO for task messages.
  • Thi will help to separate scheduler and task logging.
- 2020-06-04T09:54:10Z INFO - [foo.1] status=running: (received)succeeded at 2020-05-29T12:24:33+12:00 for job(01)
+ 2020-06-04T09:54:10Z TASK - foo.1 #01 succeeded
- 2020-06-04T09:54:10Z INFO - [bar.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
+ 2020-06-04T09:54:10Z TASK - bar.1 #01 failed (EXIT)

7) Consider tidying health check messages

  • This information is quite useful for debugging flows when polling issues occur so i'm cautious to remove.
  • Could the info be written somewhere else, e.g. the job activity log.
- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None

Or:

- 2020-06-04T10:16:36Z INFO - [foo.1] -health check settings: execution timeout=None
+ 2020-06-04T10:16:36Z DEBUG - foo.1 health check settings: execution timeout=None

8) Change DEBUG to DEBG

- 2020-06-04T10:45:45Z DEBUG - 4 task(s) de-queued
+ 2020-06-04T10:45:45Z DEBG - 4 task(s) de-queued
  2020-06-04T10:45:45Z INFO - [foo_cold.20191209T0900Z] -submit-num=01,
	<owner>@<host>

Example

Before

2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1                                   
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -submit-num=01, owner@host=<host>
2020-06-04T10:16:34Z INFO - [foo_cold.20191209T0900Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foo.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [foot.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [fool.20191209T1200Z] -triggered off []
2020-06-04T10:16:34Z INFO - [client-command] graphql <user>@<host>:cylc-tui
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo.20191209T1200Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)
2020-06-04T10:16:35Z INFO - [foo_cold.20191209T0900Z] -health check settings: submission timeout=None
2020-06-04T10:16:35Z INFO - [fool.20191209T1200Z] status=ready: (internal)submitted at 2020-06-04T10:16:35Z for job(01)

After

2020-06-04T10:16:33Z INFO - Run: (re)start=0 log=1                                   
2020-06-04T10:16:33Z INFO - Cylc version: 8.0a1
2020-06-04T10:16:33Z INFO - Run mode: live
2020-06-04T10:16:33Z INFO - Initial point: 20191209T0900Z
2020-06-04T10:16:33Z INFO - Final point: None
2020-06-04T10:16:33Z INFO - Cold Start 20191209T0900Z
2020-06-04T10:16:34Z TASK - foo_cold.20191209T0900Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foo.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - foot.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z TASK - fool.20191209T1200Z #01 submitting on <host>
2020-06-04T10:16:34Z DEBG - foo_cold.20191209T0900Z - triggered off []
2020-06-04T10:16:34Z DEBG - foo.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - foot.20191209T1200Z - triggered off []
2020-06-04T10:16:34Z DEBG - fool.20191209T1200Z - triggered off []
2020-06-04T10:16:35Z TASK - foo.20191209T1200Z submitted
2020-06-04T10:16:35Z TASK - foo_cold.20191209T0900Z submitted
2020-06-04T10:16:35Z TASK - fool.20191209T1200Z submitted

@hjoliver
Copy link
Member Author

hjoliver commented Jun 4, 2020

@oliver-sanders, I agree with all suggestions above. I like the idea of using a TASK logging level for task messages, so long as they are not "demoted" in the sense that a deliberate action is required to turn that level on, because task state changes are pretty much the whole purpose of the scheduler.

@hjoliver
Copy link
Member Author

hjoliver commented Jun 4, 2020

Also, not very difficult to implement. It will break a bunch of functional tests, but they'll be relatively easy to fix. I'll optimistically attach a "small" label 😁

@hjoliver hjoliver added the small label Jun 4, 2020
@oliver-sanders
Copy link
Member

oliver-sanders commented Jun 5, 2020

Agreed, it is a small task, it's just that there are lots of loose ends to tidy up!

I like the idea of using a TASK logging level for task messages, so long as they are not "demoted

When we create the suite loggers we would set the level to TASK so they would be visible by default.

A curious side-effect would be the ability to create pure suite or task logs.

@oliver-sanders
Copy link
Member

oliver-sanders commented Jun 11, 2020

Oh, and another one:

9) Remove the debug line which accompanies state changes.

  • Duplicate information, not needed
- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01)
- 2020-06-11T15:24:27+01:00 DEBUG - [foo.1] -ready => submitted
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted

@oliver-sanders
Copy link
Member

Post SoD I'll add another one:

10) Remove flow labels if only one flow is active.

  • Reflow is a niche feature which most users won't understand or use.
  • When there is only one active flow label there is no need to log it with every task event.
  • We should log when flow labels are created, stopped or merged.
- 2020-06-11T15:24:27+01:00 INFO - [foo.1] status=ready: (internal)submitted at 2020-06-11T15:24:26+01:00 for job(01) flow(X)
+ 2020-06-11T15:24:27+01:00 TASK - foo.1 #01 submitted

@hjoliver
Copy link
Member Author

hjoliver commented Aug 5, 2020

  1. Remove flow labels if only one flow is active.

So long as we're careful with the definition of "active" here. If we hold the main flow, then trigger a reflow, then resume the main flow later after the reflow is done, it should not look like (in the log) it was all the same flow because there was only ever one of them "active" at a time.

@oliver-sanders
Copy link
Member

So long as we're careful with the definition of "active" here

Would have thought that FlowLabelMgr.inuse should be sufficient?

@hjoliver
Copy link
Member Author

hjoliver commented Aug 6, 2020

Yes, that will be sufficient 👍 (Just saying, "active" has to mean "there are tasks present with this flow label" - but those tasks don't themselves have to be "active"!)

@MetRonnie
Copy link
Member

MetRonnie commented Sep 6, 2021

A new logging level between INFO and DEBUG would be useful for cylc clean -v. That would tell you what exactly files are being deleted without any of the other debug level stuff. This level would be more general than TASK; not sure what it would be called. Maybe VERBOSE, or NOTE? It might also be nice to have it in the same grey colour as debug messages. (Edit: actually NOTE sounds to me like something that would go in between INFO and WARNING)

@hjoliver
Copy link
Member Author

hjoliver commented Sep 6, 2021

(In case anyone plans to work on this in the near future, note there are some cosmetic changes to task proxy logging already coming in #4300).

@wxtim wxtim removed their assignment Sep 6, 2021
@oliver-sanders oliver-sanders modified the milestones: some-day, cylc-8.0.0 Dec 14, 2021
@oliver-sanders
Copy link
Member

Tagging this against 8.0.0, it would be good to tackle some of this during the 8.0.0 stabilisation period.

@wxtim wxtim self-assigned this Dec 14, 2021
@wxtim
Copy link
Member

wxtim commented Dec 14, 2021

I'll assign myself - I'd intended to poke this anyways

@wxtim
Copy link
Member

wxtim commented Dec 21, 2021

Thoughts on updates to @oliver-sanders proposal in the light of looking at the current code

1) State change messages

  • remove time information
    • It's already in the timestamp
    • It's also recorded in the database
  • Remove previous state
    • Unless changed unexpectedly
  • Remove task communication method
    • Unless state changed due to poll and task communication method != poll
  • Use same #NN job number format as we use in the GUI & TUI.
  • Change the task proxy __str__ method:
    • To reflect universal ID if it hasn't been done.
    • To reflect the changeability of components in a similar way to ISODatetime: i.e. - Flow, ID (multiple tasks in a flow), submit (multiple submits per reflow), state (multiple state per submit)
- [foo.1 job:01 flows:1 preparing] (internal)submitted 2020-05-29T12:24:34+12:00
+ 2020-05-29T12:24:34+12:00 INFO - [flow:1 1/foo #01 submitted]
  • This should be accompanied by a slide in the course about interpreting this.

2) Make submitting messages comprehensible

  • Remove <user> from user@host
    • We will no longer support this functionality and the user that Cylc reports might not be the same as the remote user anyway (due to SSH config).
- 2020-06-04T09:54:10Z INFO - [foo.1] -submit-num=01, owner@host=<user>@<host>
+ 2020-06-04T09:54:10Z INFO - [flow:1 foo.1 #01] submitting on <host>

5) Demote task failure messages from CRITICAL to INFO

@wxtim Votes for WARNING**

  • Failed is no more important than succeeded if the failure is handled correctly.
  • If the failure is not handled we will get a suite stalled warning anyway.
  • Remove the duplicate CRITICAL message whilst we are at it.
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] status=running: (received)failed/EXIT at 2020-06-04T10:18:18Z for job(01)
- 2020-06-04T09:54:10Z CRITICAL - [foo.1] -job(01) failed
+ 2020-06-04T09:54:10Z INFO - foo.1 #01 failed (EXIT)

@oliver-sanders
Copy link
Member

Ideas:

  • Don't log the flow numbers if no additional flows have been created (useless info).
  • Don't log the task state inside the [square brackets] (duplicate info).
  • Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately.
  • Post Universal ID consider using the full Job ID e.g. [cycle/task/job] rather than [task.cycle #job].
  • The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).

@hjoliver
Copy link
Member Author

Updated terminology (for logging of failures):

Failed is no more important than succeeded if the failure is handled correctly.

  • ... if failure is expected or success is optional

If the failure is not handled we will get a suite stalled warning anyway

  • ... if success is expected we will get an incomplete output warning on failure

@hjoliver
Copy link
Member Author

hjoliver commented Dec 21, 2021

Ideas:

  • Don't log the flow numbers if no additional flows have been created (useless info).

True if the only flow number in the whole run is 1, but as soon as another flow gets created we need to start logging it even if only one flow is present, because two flows can cover the same parts of the graph at entirely separate wall-clock times. So I think it's best to log it all the time for consistency, which makes log parsing easier. So let's just make it really concise.

  • Don't log the task state inside the [square brackets] (duplicate info).

It's not really duplicate info, because scanning back up the log to find the most recent previous state change is painful. And at state transitions we log [current state] => new state. However, I think we can drop it on the basis that the state can generally be inferred from the logged event.

  • Log task failures to INFO because if any compulsory outputs are not satisfied they will be logged separately

👍

  • Post Universal ID consider using the full Job ID e.g. [cycle/task/job] rather than [task.cycle #job].

👍

  • The timestamp on state changes is only useful in the event of polling (when the event time may differ from the log time).

👍

@hjoliver
Copy link
Member Author

hjoliver commented Dec 21, 2021

flow number logging proposal

On current master, flow numbers are not part of the task/job ID so should probably go outside of the square brackets.

[cycle/task/job] (flows) MESSAGE  # e.g. [2025/foo/01] (1)  or [2025/foo/01] (1,2,4) 

Note (1) takes up almost no space, so we might as well log it for consistency even if no other flow was started. (We can document what it means, of course, just like job submit number etc.).

@wxtim wxtim assigned wxtim and unassigned wxtim Jan 13, 2022
@wxtim
Copy link
Member

wxtim commented Jan 13, 2022

  1. Demote task failure messages from CRITICAL to INFO
    Failed is no more important than succeeded if the failure is handled correctly.

Can we go for warning? I'm not sure I think that every path is equal, and would like to give this more weight...

@hjoliver
Copy link
Member Author

hjoliver commented Jan 13, 2022

I'd say, failure is no more important than success for scheduling (for the graph), if handled.

But for logging purposes failure should still be highlighted. It shows, by definition, that a task failed to do what it is supposed to do, and even if the failure is handled we might still want to know that handling was required. E.g. if an occasional failure becomes a regular event maybe the failing task should be modified rather than "handled".

@wxtim
Copy link
Member

wxtim commented Jan 14, 2022

@wxtim wxtim removed their assignment Jan 14, 2022
@hjoliver
Copy link
Member Author

[ ] Demote task failure messages from CRITICAL to INFO

In light of my comment above, this needs arguing. IMO it stays CRITICAL or at least WARNING in the log.

@MetRonnie
Copy link
Member

If we shorten DEBUG I would prefer DBUG

@MetRonnie MetRonnie mentioned this issue Aug 17, 2022
8 tasks
@oliver-sanders oliver-sanders mentioned this issue Sep 29, 2022
8 tasks
@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.0, cylc-8.2.0 Oct 18, 2022
@MetRonnie MetRonnie removed their assignment Nov 11, 2022
@oliver-sanders oliver-sanders modified the milestones: cylc-8.2.0, cylc-8.x Jun 29, 2023
@oliver-sanders
Copy link
Member

This pattern of log messages is somewhat confusing:

INFO - [1/a:waiting] => waiting(queued)
INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing

This means that the task went through the queue and was then prepared for submission. But the => waiting makes it look like it went back to the unqueued state. This is simply reflecting what the internal implementation does.

But we should be able to go straight from waiting(queued) to preparing without going back to waiting in the mean time.

@hjoliver
Copy link
Member Author

hjoliver commented Jan 23, 2024

Agreed, that's annoying, but it might be somewhat tricky to solve because tasks don't necessarily go direct from queued to preparing. A task waiting on other prerequisites, or xtriggers, may not be queued.

It might be easier, and still an improvement to skip the initial plain waiting message:

INFO - [1/a:waiting(queued)] => waiting
INFO - [1/a:waiting] => preparing

... because (as I recall) all tasks go to queued at first, and are then immediately released to waiting if not currently queue-limited.

@hjoliver
Copy link
Member Author

[from offline chat] it would be really nice to have at least one "verbose" logging level between INFO and DEBUG.

[This was already suggested somewhere above on this issue by @MetRonnie so here's my take on it].

Note 3rd party package: https://pypi.org/project/verboselogs and https://github.com/xolox/python-verboselogs It is a trivial extension of Python logging, which we could easily do ourselves to avoid the dependency.

Then I'd suggest, something like this:

  • INFO (for users): minimal trace of normal task events and state changes
  • VERBOSE (for users): plus additional informative details such as tasks entering and exiting the scheduler task pool, "health" messages detailing next polling times, etc.
  • DEBUG (for developers): plus the extreme verbosity, line numbers, and tracebacks, needed for literally debugging the scheduler (which I find pretty painful to read under most circumstances - it shouldn't be needed for just following events in more detail)
  • (WARNING and CRITICAL as now)

@hjoliver
Copy link
Member Author

When we get a chance, this issue needs reviewing:

  • there's a lot of suggestions for log changes
  • some changes have been made already
  • move any remaining good suggestions out to self-contained issues and close this as superseded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants