cylc message severity levels #2505

ColemanTom · 2017-12-07T06:22:02Z

Hi,

I thought it would be good to follow standard syslog severity levels. At the moment it appears to allow NORMAL, WARNING, CRITICAL. Standard syslog is: DEBUG, INFO, NOTICE, WARNING, ERR, CRIT, ALERT, EMERG

See: https://docs.python.org/3/library/syslog.html and https://en.wikipedia.org/wiki/Syslog

matthewrmshin · 2017-12-07T07:16:50Z

Python's logging module doesn't do all these either, the last time I looked. What are we trying to support that requires all these levels?

hjoliver · 2017-12-07T07:21:07Z

Note we have a "CUSTOM" level too, and this functionality overlaps with event-handling somewhat in Cylc.

I'm of two minds about this. I suppose we could allow additional levels that could be used with custom messages in user job scripting, just for logging - user or site-defined meaning. On the other hand, Matt makes a good point.

matthewrmshin · 2017-12-07T07:51:05Z

Not really against, but would be interested to understand the requirements here.

ColemanTom · 2017-12-07T22:45:53Z

Fair point that I didn't really justify anything. I wrote all severity levels, but yes, you are probably correct that it would be overkill. I had looked at the python stdlib syslog library, rather than the logger library, so I thought all were included. I'm not entirely sure how CUSTOM works, so I can't comment on it. Further exploration of why this was in my head is not written above. Basically, I think having a bit more control over the log levels is useful.

For example, in the python logger, you can tell it to only print out messages above a specific level. This would allow people, during development, to put in a bunch of extra log information (e.g. log_debug), but by a configuration have them turned off in operations to avoid polluting log files. This allows a smoother transition to operations, but, if setup properly, would allow people to do an edit run on a failure to turn the debug messages back on to help figure a problem out.

The other aspect of this would relate to the alerting downstream. I don't know the details exactly, but I do know cylc is being configured to work with message brokers to deliver messages to alerting and monitoring systems. The granularity in levels would provide a direct link to the alerting mechanism to help prioritise resolution (when combined with some priority ranking of the system in the organisational context). Perhaps this is already figured out though and I am going down a weird path? But, for example, say a task is running, it is doing some data format conversion on lustre (netcdf to/from grib2 for example), and you find evidence that the file is corrupted. Perhaps that should be raised as an emergency level requiring immediate escalation to 2nd level support rather than first level trying to triage it because there is most likely something wrong with one or more of the lustre OSTs.

tldr - real idea should be more along the lines of;

It would be nice to be able to have a couple of more levels, such as DEBUG, such that you can have cylc message -p "DEBUG" ... peppered through scripts, but, some configuration setting accessible via the edit run interface, would allow you turn them off in operations, but on in the case of trying to resolve an unseen before problem.
I imagine that more levels may provide better granularity for triage support and how 1st level support should act in the case of failures (but this may already be sorted out via however the message broker integration is being done).

Does the above make a bit more sense. Sorry for the brevity/not fully fleshing my thoughts out initially.

ivorblockley · 2018-02-01T04:41:05Z

To weigh in on the use-cases:

I think differentiating between errors and fatal/critical error diagnostic/alerting messages could be useful. For example it is conceivable for a task to encounter real errors when invoking commands or interfacing with databases etc. Sometimes this might result in the suite's progress to halt (let's call this scenario a fatal error). In other cases the task may have fall-back logic programmed in to work-around the errors (e.g. ... if system-wide open file-handle limit is hit, wait awhile and retry on the assumption this condition is sporadic) and these errors could be reported as errors (warranting serious and timely investigation) but they did not cause a critical error for the task or suite (and hence operations).

Supporting a debug severity level would also be nice for reasons Tom has mentioned.

There is a discussion about severity levels and what distinguishes them at https://stackoverflow.com/questions/2031163/when-to-use-the-different-log-levels which I found informative. Ultimately I think this issue comes down to determining if these distinguishing characteristics are useful to downstream applications/customers/operators. I think there is a case, although out of the standard syslog severity levels (IETF RFC5424), I have to admit I don't see a need for ALERT.

A TRACE level to support finer-grained debugging could be useful (this goes beyond the syslog convention).

matthewrmshin · 2018-02-01T08:53:33Z

#386 is related.

Annoyingly, Python's logging module maps severity levels from 10 (debug) to 50 (critical) - the number increases with severity level, whereas syslog maps the main severity levels from 7 (debug) to 2 (critical) - the number decreases with severity level.

What we can do... Pick either logging or syslog as a basis. (The former is more likely, given that it is already used in the logic.) Modify cylc message to allow any severity level. If the specified level is recognised, the reporting system will respect the level in the normal way. Otherwise, the level is considered custom - and the reporting system will act according to any custom event handlers (but can probably default to e.g. logging.INFO).

matthewrmshin · 2018-02-22T15:15:06Z

#2582 should solve the cylc message part of this issue.

Still need to figure out the following:

How to deal with logging levels on the suite side. I think we need to rationalise how we configure logging for the running suite. My normal instinct is to introduce a setting to configure the logging level of the suite (as opposed to having the verbose and debug flags). We should also consider whether we need to duplicate log entries in both log/suite/log and log/suite/err rationalize use of suite stdout, stderr, and the log #386. Done by Improve logging #2781.
A job failure currently has a CRITICAL severity. Should this be an ERROR instead? (And should a job failure be a WARNING for tasks that have retries lined up?) Or perhaps this should be configurable per task? (New runtime config item: "priority"? #2289?)

hjoliver · 2018-02-25T06:17:16Z

@matthewrmshin - responding to the previous comments:

the original intention for the debug flag was to print Python tracebacks,, and otherwise just a simple error message for users who should not be expected to understand Python tracebacks. Not sure that's the best approach though, not least because it may be inconvenient to re-run a failed suite in order to get a traceback. Aside from debug ,a multi-level verbosity flag seems sensible to me. Also, I'd be happy to not duplicate suite err message in the suite log (we don't for job.err after all).
this is a tricky one! A job failure is typically critical for the job, but not the suite. Maybe we need two categories of CRITICAL (one for job, one for suite). But as you note, a job failure when there are retries lined up is presumably less critical. I'd prefer not to make it configurable unless we really have to, as I doubt many would resort to that. This might be a good one to discuss in June...

matthewrmshin · 2018-12-17T15:07:28Z

With #2582 and #2781, we should now be aligned with Python's logging module.

Things left to do before closing this issue:

Agree on the default logging level of a failed job with and without retries lined up.
- CRITICAL - as now.
- ERROR, or WARNING if job is expected to fail from time to time (e.g. has follow-on retries, or where failed output is a prerequisite of a downstream task).
Fully expose suite logging via configuration. (Requires Python 3 for easy implementation.)

matthewrmshin · 2019-03-11T21:34:27Z

Tentatively re-targeting this for Cylc 8. Now that the code is in Python 3, we can implement configurable logging.

oliver-sanders · 2020-12-11T09:59:53Z

#3647

oliver-sanders · 2023-05-04T14:47:33Z

#2582 and #2781 solved the cylc message side of things (which covers the OP).

Agree on the default logging level of a failed job with and without retries lined up.

This isn't related to cylc message, but is now covered by #3647.

Fully expose suite logging via configuration.

Have yet to encounter a use case for this, however, with Cylc 8 it is now possible to add your own log handlers via a Cylc configuration plugin. If there's any interest in this let us know.

matthewrmshin added this to the later milestone Feb 1, 2018

matthewrmshin mentioned this issue Feb 1, 2018

cylc message should have submit number info #2528

Closed

matthewrmshin mentioned this issue Feb 21, 2018

Improve comm #2582

Merged

matthewrmshin self-assigned this Feb 21, 2018

matthewrmshin modified the milestones: later, next release, soon Feb 21, 2018

matthewrmshin mentioned this issue Mar 28, 2018

cat-log: remote cylc sub-command #2503

Merged

matthewrmshin mentioned this issue Oct 3, 2018

Improve logging #2781

Merged

hjoliver mentioned this issue Dec 17, 2018

Port cylc to Python 3 #1874

Closed

6 tasks

matthewrmshin modified the milestones: soon, cylc-8.0.0 Mar 11, 2019

This was referenced Sep 10, 2019

Add logging cylc/cylc-uiserver#73

Merged

Unify global and local configs. #3348

Closed

oliver-sanders modified the milestones: cylc-8.0.0, some-day Dec 11, 2020

oliver-sanders closed this as completed May 4, 2023

oliver-sanders removed this from the some-day milestone May 4, 2023

matthewrmshin removed their assignment May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cylc message severity levels #2505

cylc message severity levels #2505

ColemanTom commented Dec 7, 2017

matthewrmshin commented Dec 7, 2017

hjoliver commented Dec 7, 2017 •

edited

Loading

matthewrmshin commented Dec 7, 2017

ColemanTom commented Dec 7, 2017

ivorblockley commented Feb 1, 2018 •

edited

Loading

matthewrmshin commented Feb 1, 2018

matthewrmshin commented Feb 22, 2018 •

edited

Loading

hjoliver commented Feb 25, 2018 •

edited

Loading

matthewrmshin commented Dec 17, 2018 •

edited

Loading

matthewrmshin commented Mar 11, 2019

oliver-sanders commented Dec 11, 2020

oliver-sanders commented May 4, 2023

cylc message severity levels #2505

cylc message severity levels #2505

Comments

ColemanTom commented Dec 7, 2017

matthewrmshin commented Dec 7, 2017

hjoliver commented Dec 7, 2017 • edited Loading

matthewrmshin commented Dec 7, 2017

ColemanTom commented Dec 7, 2017

ivorblockley commented Feb 1, 2018 • edited Loading

matthewrmshin commented Feb 1, 2018

matthewrmshin commented Feb 22, 2018 • edited Loading

hjoliver commented Feb 25, 2018 • edited Loading

matthewrmshin commented Dec 17, 2018 • edited Loading

matthewrmshin commented Mar 11, 2019

oliver-sanders commented Dec 11, 2020

oliver-sanders commented May 4, 2023

hjoliver commented Dec 7, 2017 •

edited

Loading

ivorblockley commented Feb 1, 2018 •

edited

Loading

matthewrmshin commented Feb 22, 2018 •

edited

Loading

hjoliver commented Feb 25, 2018 •

edited

Loading

matthewrmshin commented Dec 17, 2018 •

edited

Loading