Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make exception handling explicit #5202

Closed

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Apr 23, 2024

Impetus

Following the transition to the new exit code (2) for easier user introspection of cloud-init, there was a request to audit current callsites for error conditions that are logged at too high or low of a logging level. This PR starts that effort, but only scratches the surface. Much more work needs to be done auditing log levels.

Objective

This PR is the result of an audit of exceptions of type Exception. This audit had a simple goal: prevent cloud-init from swallowing exceptions. We don't want exceptions such as AttributeError, IndexError, KeyError, TypeError, ZeroDivisionError to be ignored when the exception that we expect to handle is a failed filesystem operation - this would mask important errors at runtime.

Rules

To complete the objective, a set of rules was defined and implemented. They are described in one of the commit messages:

Require a minimum of one of the following in each Exception handler:

  • Log a warning at level WARN or higher containing the exception type and message
  • Re-raise the exception without redefining its type.

By following these rules, we can expect that every Exception handler will either issue its own warning or raise to another exception which will raise a warning. By doing this, silently ignoring important classes of exceptions will not happen.

Any log of WARN would suffice for ensuring that these exceptions are logged, however standardizing this message across the codebase simplifies code auditing and makes it clear to developers when they see this message what has happened.

Methodology

Inspect the code for sources of expected exceptions and add them above the Exception handler, and adding a proper WARN log to the existing Exception exception. This allows us to flesh out any unexpected changes easily in integration testing without risking unintentional logic changes.

Reviewer notes

I recommend reading the individual commit messages while reviewing. This PR carries some risk of changing cloud-init logic, but it attempts to mitigate the risk by retaining Exception (in most cases) while making all Exception handling into a recoverable error. I believe that the risk that this poses will (after thoroughly reviewed) be outweighed by the current risk we have of unintentionally silencing unexpected exceptions.

A few things to keep in mind while reviewing:

  1. subp.subp() throws OSError
  2. OSError is thrown by many filesystem and operating system operations. It is used liberally in this PR.
  3. I don't like the repetitive LOG.warning("Unhandled..."). Ideas are welcome for better messaging and maybe a helper function would make sense - maybe something in log.py?
  4. Perhaps the rules above should probably go in a dev doc?

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch 2 times, most recently from 7cee29f to 6a845c4 Compare April 23, 2024 03:04
@holmanb holmanb changed the title Stricter exception handling Make exception handling explicit Apr 23, 2024
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch 2 times, most recently from dfaeb56 to 0888048 Compare April 23, 2024 18:44
@holmanb holmanb marked this pull request as draft April 26, 2024 00:50
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch from 0888048 to 12ffead Compare April 26, 2024 01:34
@holmanb holmanb marked this pull request as ready for review April 26, 2024 01:36
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch 2 times, most recently from fe33823 to 8e41a0e Compare April 27, 2024 00:13
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch 4 times, most recently from 15ac27c to 2864b51 Compare April 30, 2024 21:48
@TheRealFalcon TheRealFalcon self-assigned this May 1, 2024
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spearheading this effort! I left some comments inline.

Beyond those, I have a few more broad concerns.

Your LOG.warning("Unhandled exception: %s", e) rule is a bit too rigid to me. There are a bunch of places that had something like:

try:
    do_something()
except Exception as e:
    LOG.warning("oh no! %s", e)
    recover()

that turned into something like:

try:
    do_something
except ValueError as e:
    LOG.warning("oh no! %s", e)
    recover()
except Exception as e:
    LOG.warning("Unhandled exception: %s", e)
    LOG.warning("oh no! %s", e)
    recover()

I'm not sure how this is an improvement. If we're logging the exception with the old message, then why is the broad exception a problem? Sure, best practices says you should only catch exceptions as narrowly as you need to, but there are certainly use cases for catching Exception, and as long as we're not actively swallowing exceptions, I don't see the problem...especially if the exception handling is sitting at the top of a giant call stack. In those cases it makes sense to be overly broad in our handling.

To take an example from a recent change that I approved:

except Exception as e:

I saw the Exception here, thought about it for a second, and then decided that it was the best course of action. For two reasons:

  1. We're logging the exception, so any unexpected issues with still get logged.
  2. I quite literally don't care what happens for the purposes of setting up cloud-init. The netplan API could raise a YourServerIsLiterallyOnFireException. We'll log it, bail, and then do our own v2 rendering.

In this case, what benefit is there to narrowing the exception type and having duplicate exception handlings with one having an additional warning? Is narrowing the exceptions worth the risk that we missed one or that a new one may pop up in the future?

My other major concern is logging new warnings that will cause cloud-init to exit 2. There are a number of places (random example) where we'll get a new warning where we didn't before. In that example, it even seems there was some effort in ensuring the exception can be silenced when we want it to.

Also for future reference, the bigger a PR gets, the more that is going to fall through the cracks during review. For PRs this big, I think it'd be best to move anything not strictly part of the goal to separate PRs. There's a few refactoring-type commits (json decoding, ModuleDetails) that don't really seem to belong here. Now that they're reviewed, I'm not asking for them to be moved...more just a future thing.

cloudinit/analyze/show.py Show resolved Hide resolved
@@ -149,6 +151,7 @@ def remove_artifacts(remove_logs, remove_seed=False, remove_config=None):
try:
runparts(settings.CLEAN_RUNPARTS_DIR)
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree with these unhandled exception rules when it comes to CLI commands. The exception is being printed/handled using the error() call below this one, and does it in a more graceful way. I think a log message on the CLI like the following will just add more noise:

2024-05-01 18:10:39,200 - clean.py[WARNING]: Unhandled exception: ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 see comment below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No response needed, but leaving this open because I have changes to make here.

@@ -745,8 +745,12 @@ def status_wrapper(name, args):
else:
try:
status = json.loads(util.load_text_file(status_path))
except Exception:
pass
except OSError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to warn if a file isn't found here?

Regardless, I would probably downgrade these all to info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't happen, then either a previous stage failed to create a status.json, or something removed/modified it/corrupted it, or a cloud-init stage is getting manually run out of order(basically the same thing as this PR).

All of these sound like "unexpected / bad things" to me - wouldn't a warning make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheRealFalcon waiting for response on this comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that other PR already contains a warning, right? I guess it seems like fairly large shift from "it doesn't matter if this fails because we'll create it anyway" to "we need to warn you about this even though it has no bearing on the resulting artifact".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I'll remove it.

cloudinit/config/cc_mounts.py Outdated Show resolved Hide resolved
cloudinit/config/cc_mounts.py Outdated Show resolved Hide resolved
@@ -32,7 +32,7 @@ def get_metadata(
connectivity_url_data={"url": url},
):
# Fetch the metadata
v1 = read_metadata(url, timeout, retries, sec_between, agent)
v1 = _read_metadata(url, timeout, retries, sec_between, agent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I prefer to convert from bytes to strings ASAP and I'm not sure why we needed to rename the function. Neither change is bad but the changes in the file feel superfluous to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to convert from bytes to strings ASAP

generally no objections to this philosophy, however in cases like this co-locating the .decode() with the json.loads() could make the exception handling simpler by including the exception handler that's already required for the json.loads() with the UnicodeDecodeError handler. That said, we aren't actually handling this exception here, so that's not an argument for keeping it this way. I can drop the changes to this file if you want.

Copy link
Member Author

@holmanb holmanb May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop the changes to this file if you want.

@TheRealFalcon ping - that whole commit might be unwanted given the "convert from bytes asap" philosophy. Thoughts on how to proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a nit, so it's not a strong opinion. I'm fine keeping it this way.

cloudinit/util.py Outdated Show resolved Hide resolved
cloudinit/util.py Outdated Show resolved Hide resolved
cloudinit/stages.py Outdated Show resolved Hide resolved
cloudinit/stages.py Outdated Show resolved Hide resolved
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch from 4a24b71 to 3376535 Compare May 3, 2024 23:18
@holmanb
Copy link
Member Author

holmanb commented May 4, 2024

Thanks for the review @TheRealFalcon! Great feedback all around.

Your LOG.warning("Unhandled exception: %s", e) rule is a bit too rigid to me. There are a bunch of places that had something like:

That's fair. It made auditing the code easier to implement this way, but we don't need to be so strict. I've updated the rules in the description above to say:

Require a minimum of one of the following in each Exception handler:

  • Log a warning at level WARN or higher containing the exception type and message
  • Re-raise the exception without redefine its type.

I want to keep the rule list short so that it is easy to remember, while being obvious and clear enough that it doesn't require much interpretation when trying to write or review code. Let me know what you think, suggestions are welcome.

My other major concern is logging new warnings that will cause cloud-init to exit 2. There are a number of places (random example) where we'll get a new warning where we didn't before.

I have some concerns about that too. In the past we've allowed some behavior changes that are bug fixes. I would consider properly warning (and exit 2) about a previously unreported error condition to be a bug fix. There is risk of big unknowns that crop up from this change, but I'm happy to be slow and meticulous with this PR if that will help us build confidence about its direction and consequences.

I'm not sure I agree with these unhandled exception rules when it comes to CLI commands. The exception is being printed/handled using the error() call below this one, and does it in a more graceful way. I think a log message on the CLI like the following will just add more noise:

2024-05-01 18:10:39,200 - clean.py[WARNING]: Unhandled exception: ...

Agreed, applying the rule feels clunky here and doesn't improve the end user's experience. Unfortunately, writing to stderr wouldn't be an appropriate way to log unhandled warnings outside of CLI commands. The short/obvious/clear thing breaks down when trying to draw a box around where cli commands start and end, primarily for the following two reasons:

  1. cloud-init imports code back from cloudinit.cmd into the rest of the codebase (read_cfg_paths)
  2. a big chunk of cloud-init's core runtime logic is implemented in cloudinit.cmd.main

These are just symptoms of the codebase's haphazardous structure. Both of these points are undesirable for several reasons, but while 1) would be trivial to fix 2) would probably take some forethought and planning. I have some thoughts on this topic for how we can better structure the core codebase and how this would benefit the project in different ways, but I'll save that discussion for later.

For now maybe we add a rule that says something like: > - for code in cloudinit/cmd/ (excluding main.py and __init__.py): Print a message to stderr containing the exception type and message.?

Maybe someday if we feel the urge to address 1) and 2) we can change the rule to say > - Code in cloudinit/cmd/: Print a message to stderr containing the exception type and message.

Thoughts?

@TheRealFalcon
Copy link
Member

From a high level, I think everything you said makes sense. Thanks for laying it all out. I just want to make sure I understand. By Log a warning at level WARN or higher containing the exception type and message, you mean that we don't need a separate handler, correct? Something like this could be ok?

try:
    do_something()
except Exception as e:
    LOG.warning("oh no! %s", e)
    recover()

If so, +1. If not, let me know what I'm missing.

If we want to have hard rules around this behavior, I think:

  1. Let's document it. Maybe our docs could use a section for such coding guidelines? (doesn't have to be this PR)
  2. I also want to be flexible where needed. I think as long as we have a comment explaining why such a rule doesn't apply (and a reviewer agrees), we should be ok with occasionally breaking the rules.

@holmanb
Copy link
Member Author

holmanb commented May 8, 2024

From a high level, I think everything you said makes sense. Thanks for laying it all out. I just want to make sure I understand. By Log a warning at level WARN or higher containing the exception type and message, you mean that we don't need a separate handler, correct? Something like this could be ok?

try:
    do_something()
except Exception as e:
    LOG.warning("oh no! %s", e)
    recover()

If so, +1. If not, let me know what I'm missing.

Yes, that would be just fine. I think we're on the same page about this.

If we want to have hard rules around this behavior, I think:

1. Let's document it. Maybe our docs could use a section for such coding guidelines? (doesn't have to be this PR)

+1 Agreed. I can do that in a follow-up PR

2. I also want to be flexible where needed. I think as long as we have a comment explaining why such a rule doesn't apply (and a reviewer agrees), we should be ok with occasionally breaking the rules.

+1 I think the rules are best followed in most cases, but exceptions should be allowable. Even in this PR I break this rule in hotplug_hook.py. In that case it follows the spirit of the rule by eventually raising something if an unknown exception is caught, and it's probably not safe to reasonably assume that we know all possible exceptions that might be raised from that call site. Even if we got the exceptions right in this PR, future changes to any code called in that site might break things.

I do think that this PR a larger hurdle in risk of behavior changes than future code which follows these rules will have. That is because in future PRs, we won't have to worry so much about existing runtime exceptions which silently behave in a way that is currently expected by the user since any of those that exist in theory should get fleshed out by this PR.

@holmanb
Copy link
Member Author

holmanb commented May 14, 2024

Pre-merge requirements:

  1. Eliminate unnecessary LOG.warning("Unhandled exception",) -> @holmanb
  2. Wait until after the 24.2 release conflicts again -> @holmanb

Follow-up:

  1. Document the rules

Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label May 29, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label May 29, 2024
@holmanb
Copy link
Member Author

holmanb commented May 29, 2024

@TheRealFalcon I forgot to mention that I'd also be okay with patching this out of older releases to mitigate the risk of this change. If we do see new bugs shake out of cloud-init in Oracular as a result of this, the resulting bug fixes will still benefit old releases.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another pass here. I know you still have the Eliminate unnecessary LOG.warning("Unhandled exception",) piece left to do but figured this was looking at anyway.

I forgot to mention that I'd also be okay with patching this out of older releases to mitigate the risk of this change

I think that's overkill for the entire PR. These changes are spread far enough that I can foresee hitting quilt conflicts all the time if we patched it all out. That said, it might be worth separating the risky ones and patching those. In this review, I added a Rule exception or quilt patch. comment to lines where I think we either need to make an exception to the rule or add a quilt patch for existing releases for that particular line.

cloudinit/config/cc_growpart.py Outdated Show resolved Hide resolved
cloudinit/config/cc_growpart.py Outdated Show resolved Hide resolved
cloudinit/cmd/main.py Outdated Show resolved Hide resolved
LOG.debug(
"User %s is not correctly configured for ssh_import_id", user
)
util.logexc(LOG, "Unhandled configuration for user %s", user)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid warning, but we should patch it as to not accidentally break deployments that have an invalid import id but otherwise work.

cloudinit/net/__init__.py Outdated Show resolved Hide resolved
except IndexError:
continue
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule exception or quilt patch.

except IndexError:
continue
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule exception or quilt patch.

except ValueError:
ts = "??"
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rule exception or quilt patch.

except Exception:
pass
except Exception as e:
LOG.warning("Unhandled exception: %s", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quilt patch

@@ -100,24 +100,6 @@ def test_handler_creates_and_runs_bootcmd_script_with_instance_id(self):
my_id + " iid-datasource-none\n", util.load_text_file(out_file)
)

def test_handler_runs_bootcmd_script_with_error(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this no longer needed? The log is gone but we should still get the exception.

Copy link

github-actions bot commented Jul 3, 2024

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@holmanb holmanb added the wip Work in progress, do not land label Jul 31, 2024
@holmanb
Copy link
Member Author

holmanb commented Jul 31, 2024

setting to wip label, blocked on #5573

Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Aug 15, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Aug 15, 2024
Copy link

github-actions bot commented Sep 4, 2024

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Sep 4, 2024
@github-actions github-actions bot closed this Sep 12, 2024
@holmanb holmanb reopened this Sep 12, 2024
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Sep 12, 2024
From this helper's docstring:

    Ubuntu cloud images previously included a 'eth0.cfg' that had
    hard coded content.  That file would interfere with the rendered
    configuration if it was present.

This code is no longer relevant due to the following:

1) Ubuntu uses netplan
2) This appears like an Ubuntu-specific workaround
3) Debian cloud images use netplan
Since Python 3.6, json.loads() is capable of receiving arguments of type
str, bytes, and bytearray. Cloud-init's code historically manually handled type
conversion. This boilerplate is no longer necessary.

Add docstring and typing to util.load_json().
Cloud-init's codebase makes extensive use of exception handlers which
catch Exception and then do something based on the assumed exception
type. This is bad practice and makes it possible for exceptions to be
unexpectedly ignored. To remedy this, this commit updates cloud-init to
make `Exception` always be an unknown exception which therefore should
always produce a log of level `WARN` or higher.

Require a minimum of one of the following in each Exception handler:

- LOG.warning("Unhandled exception: %s", e)
- util.logexc() - with a log level of WARN or higher
- re-raise the exception without setting exception type when
  re-raising[1]

Make cloud-init's code adhere to the above set of rules with the
following changes:

- Limit the scope where Exception is handled to only unexpected error paths.
- Add new handlers for more specific exception types.
- Add warning logs or increase log level to WARN in some cases.
- Add typing where exceptions are returned.
- Replace various Exception handlers with ProcessExecutionError.
- Remove handling that does nothing.
- Remove unused code which handles Exception.

[1] This would likely reduce the scope of the exception and make it
    possible to mask an Exception.
In Python 3.3+, IOError is an alias of OSError[1]. Some call sites use
one or the other or both. Standardize on the OSError.

[1] https://docs.python.org/3/library/exceptions.html#IOError
Copy link

github-actions bot commented Oct 5, 2024

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Oct 5, 2024
@holmanb holmanb force-pushed the holmanb/stricter-exception-handling branch from d68eb04 to 1786f9d Compare October 7, 2024 14:08
@holmanb holmanb removed the stale-pr Pull request is stale; will be auto-closed soon label Oct 7, 2024
Copy link

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Oct 22, 2024
@github-actions github-actions bot closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-pr Pull request is stale; will be auto-closed soon wip Work in progress, do not land
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants