Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bubble up session loop's exceptions #736

Merged
merged 4 commits into from
Mar 16, 2023

Conversation

andsel
Copy link
Collaborator

@andsel andsel commented Mar 3, 2023

When a session loop the exception is simply logged and lost, so the error status is not reported to the broker.
This PR interrupt the session loop on an exception, saving the error in a local variable.

When the PostOffice terminates it re-throw the exception that was reached during session loop
execution, while in the meantime it notifies the user implemented API InterceptHandler.onSessionLoopError.
The firing of the newly added event type InterceptExceptionMessage containing the error stacktrace, is done asychronously from the launching thread (the session loop thread that terminating because of the error).

What it does?#

  • create a new interception hook in the Moquette interception framework.
  • modify the SessionEventLoop to be itself a thread (and not a Runnable task)
  • set an uncatchedExceptionHandler in the newly ``SessionEventLoop` thread to scatter the notification.
  • collect the error, if any, for each SessionEventLoop and log on termination of PostOffice.

@andsel andsel self-assigned this Mar 3, 2023
@andsel andsel force-pushed the fix/bubble_session_loop_exceptions branch from bb5a7db to 9ac604d Compare March 4, 2023 08:31
@andsel
Copy link
Collaborator Author

andsel commented Mar 4, 2023

Hi @hylkevds
please may I ask your review on this?

@andsel andsel requested a review from hylkevds March 4, 2023 08:57
When a session loop the execption is simply logged and lost, so the error status is not reported to the broker.
This PR interrupt the session loop on an exception, saving the error in a local variable.

When the PostOffice terminates it re-throw the first (and last) exception that was reached during session loop
execution.
@andsel andsel force-pushed the fix/bubble_session_loop_exceptions branch from 9ac604d to e8001d6 Compare March 4, 2023 17:49
@hylkevds
Copy link
Collaborator

hylkevds commented Mar 6, 2023

If I understand the changes correctly, when an exception happens in a task, this will store the exception and then end the SessionEventLoop that had the exception. Then, when the server exits, it will re-throw the exception.

So between the exception happening and the server being shut down, all of the Sessions that are assigned to the event loop that had the exception will hang?

That's both good and bad :)
On one hand, trying to continue when (for instance) the queue store is corrupt is a recipe for disaster, but on the other hand, we have to make sure we only throw exceptions for things that are really fatal.

And when something is fatal, like the queue store being corrupted, maybe we should shut down the entire server?

Maybe we should make two types of exception? Fatal ones that shut down the server to prevent further corruption, and non-fatal ones that just kill the session that had the exception?

hylkevds
hylkevds previously approved these changes Mar 6, 2023
Copy link
Collaborator

@hylkevds hylkevds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does what it is supposed to do. 👍

@andsel
Copy link
Collaborator Author

andsel commented Mar 7, 2023

This requirement come out to me during debugging of a failed test, where the expectation didn't match.
The problem was that an error happened in the SessionEventLoop and wans't logged because the log, in this case, is at info level.
So I thought that also logging at WARN or ERROR without creating an impact on the execution wouldn't help to be aware of which error happened and where it happened. So if for any reason, a SessionEventLoop dies in tests, the test should fail.
This is contrary to what Netty event loop does, where it prints the error and continue.
But on the case of session event loop, if something gone bad, it's useless to continue and just reporting on the logs the error.

@hylkevds
Copy link
Collaborator

hylkevds commented Mar 7, 2023

I agree that just continuing like nothing happened is a bad idea. The question is, is just killing the affected SessionEventLoop enough? I think, if an SessionEventLoop is killed, the entire server should exit.

@andsel
Copy link
Collaborator Author

andsel commented Mar 7, 2023

I think, if an SessionEventLoop is killed, the entire server should exit.

Yes maybe this proposal is more sensible. I think that the exception handler of every SessionEventLoop should have a reference to the Server and command an emergency shutdown, without consuming any session command queue, and avoiding to send DISCONNECT messages to all the live sessions.
WDYT?

@hylkevds
Copy link
Collaborator

hylkevds commented Mar 7, 2023

Yes, I agree.
We'll also need a hook, so that an embedding application can catch the state and decide what to do.

@andsel andsel force-pushed the fix/bubble_session_loop_exceptions branch from 41e9d8e to 331dc87 Compare March 13, 2023 19:09
@andsel andsel requested a review from hylkevds March 13, 2023 19:34
@andsel
Copy link
Collaborator Author

andsel commented Mar 13, 2023

Hi @hylkevds
I've pushed some changes, extending the interception framework to notify the library's users of errors in event loops.

@andsel andsel force-pushed the fix/bubble_session_loop_exceptions branch from e1357c5 to ac8ea0b Compare March 13, 2023 19:38
hylkevds
hylkevds previously approved these changes Mar 16, 2023
Copy link
Collaborator

@hylkevds hylkevds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@andsel andsel merged commit 2566a1b into moquette-io:main Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants