Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(store): Explicitly time the client out if the connection hung #4039

Merged
merged 1 commit into from
Oct 13, 2023

Conversation

whisperity
Copy link
Contributor

Related to #3672, but does not implement it.

This is a partial solution that attempts to gracefully kill an ongoing CodeChecker store client if the reply to the massStoreRun() API call does not arrive in a timely manner. After reaching the timeout, the client throws an exception onto itself and presents an understandable error message about the situation. This implementation is temporary; thus, the timeout is a hard-coded limit.

The client still exits with a failure in these cases because there are no guarantees (and no deterministic ways to check) whether the timeout happened because the TCP connection failed and stuck (which is observed in a lot of cases, unfortunately) or whether the server indeed took more time than allotted to process the stored reports. A successful exit signal would result in CI scripts thinking the store was successful and then potentially breaking later, unreliably, on the fact that the stored run does not, in fact, exist or is in an outdated state, etc.

@whisperity whisperity added CLI 💻 Related to the command-line interface, such as the cmd, store, etc. commands usability 👍 Usability-related features labels Oct 11, 2023
@whisperity whisperity added this to the release 6.23.0 milestone Oct 11, 2023
@whisperity whisperity requested a review from bruntib as a code owner October 11, 2023 12:11
@whisperity whisperity requested a review from cservakt October 11, 2023 12:12
@whisperity
Copy link
Contributor Author

Example for the error message, as rendered to the terminal:

[INFO 2023-10-11 14:10] - Compressing report zip file...
[INFO 2023-10-11 14:10] - Compressing report zip file done (386EiB / 4TiB).
[INFO 2023-10-11 14:10] - Storing results to the server...
[WARNING 2023-10-11 14:10] - Timeout watchdog hit 3600 seconds (1:00:00)
Traceback (most recent call last):
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main
    client.massStoreRun(args.name,
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/thrift_call.py", line 45, in wrapper
    return func(*args, **kwargs)
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/codechecker_api/codeCheckerDBAccess_v6/codeCheckerDBAccess.py", line 2325, in massStoreRun
    self.send_massStoreRun(runName, tag, version, zipfile, force, trimPathPrefixes, description)
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/codechecker_api/codeCheckerDBAccess_v6/codeCheckerDBAccess.py", line 2340, in send_massStoreRun
    self._oprot.trans.flush()
  File "/home/whisperity/Ericsson/CodeChecker/venv_dev/lib/python3.10/site-packages/thrift/transport/THttpClient.py", line 184, in flush
    self.__http_response = self.__http.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 760, in _signal_handler
    raise WatchdogError(timeout,
store.WatchdogError: Timeout watchdog hit 3600 seconds (1:00:00)
[ERROR 2023-10-11 14:10] - Timeout!
        The server's reply did not arrive after 3600 seconds (1:00:00) elapsed since the server-side processing began.

        This does *NOT* mean that there was an issue with the run you were storing!
        The server might still be processing the results...
        However, it is more likely that the server had already finished, but the client did not receive a response.
        Usually, this is caused by the underlying TCP connection failing to signal a low-level disconnect.
        Clients potentially hanging indefinitely in these scenarios is an unfortunate and known issue.
                See http://github.com/Ericsson/codechecker/issues/3672 for details!

        This error here is a temporary measure to ensure an infinite hang is replaced with a well-explained timeout.
        A more proper solution will be implemented in a subsequent version of CodeChecker.

@vodorok
Copy link
Contributor

vodorok commented Oct 11, 2023

I would prefer if we would not show a stacktrace to the user. It is a "planned" exceptional behaviour, the error message is enough.

@whisperity
Copy link
Contributor Author

whisperity commented Oct 11, 2023

I would prefer if we would not show a stacktrace to the user. It is a "planned" exceptional behaviour, the error message is enough.

@vodorok I'm not sure about that. Without the stack trace, we do not know whether the timeout was triggered when reading from the socket (a.k.a., where we expect this) or during the writing of the socket (where such could, at least theoretically, also happen, but where we do NOT expect the program to hang for so long).

The cardinal problem is the exception trigger is an asynchronous operation, so depending on where the normal execution was when the timeout triggered, it could have had killed it anywhere that is within the execution path for .massStoreRun().
It's sort of a shooting a cannon on a fly situation.

The example output message I triggered by putting an explicit and large enough timer.sleep() in the server-side handler (not part of the patch), just to simulate the situation. I also didn't wait for a real hour, I "photoshopped" the example message afterwards 😈

@vodorok
Copy link
Contributor

vodorok commented Oct 11, 2023

I checked the implementation in the meantime, and you only guard the rpc call with the watchdog.
In the end does the stacktrace matter when the watchdog triggers and shoots down the store command?

What I trying to say, that it will always be a server (or network) related issue, when the timeout triggers, and the user usually cannot do anything about it, apart from restarting the process, and hoping for the best.

But I might be missing some context here...

Copy link
Member

@dkrupp dkrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stack trace is a but ugly IMHO, but I understand why we need it.

LGTM.

@whisperity
Copy link
Contributor Author

The only difference which might be meaningful is whether the send() fails versus the recv() failing. Nevertheless, I will put the stack trace behind the debug verbosity mode and update the patch.

@vodorok
Copy link
Contributor

vodorok commented Oct 11, 2023

@whisperity Thanks that is a good compromise.

This is a partial solution that attempts to gracefully kill an ongoing
`CodeChecker store` client if the reply to the `massStoreRun()` API call
does not arrive in a timely manner. After reaching the timeout, the
client throws an exception onto itself and presents an understandable
error message about the situation. This implementation is temporary,
thus the timeout is a hard-coded limit.

The client still exits with a failure in these cases, because there are
no guarantees (and no deterministic ways to check) whether the timeout
happened because the TCP connection failed and stuck (which is observed
in a lot of cases, unfortunately) or whether the server indeed took more
time than allotted to process the stored reports. A successful exit
signal would result in CI scripts thinking the store was successfull and
then potentially breaking later, unreliably, on the fact that the stored
run does not in fact exist, or is in an outdated state, etc.
@whisperity whisperity force-pushed the feat/store/timeout-message branch from 3e06352 to 15af7d8 Compare October 12, 2023 15:20
@whisperity
Copy link
Contributor Author

@vodorok @dkrupp Meanwhile I came up with a better idea, where we still print just the two most relevant parts of the stack, but we do that outside of debug mode as well.

What do you think?

[WARNING 2023-10-12 17:16] - Timeout watchdog hit 5 seconds (0:00:05)
[INFO 2023-10-12 17:16] - Timeout was triggered during:
  File "/home/whisperity/Ericsson/CodeChecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main
    client.massStoreRun(args.name,

[INFO 2023-10-12 17:16] - Timeout interrupted this low-level operation:
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)

[ERROR 2023-10-12 17:16] - Timeout!
[... The rest of the explanation, at the bottom, stays the same as previously ...]

This ensures if users notify us about the problem we can clearly identify (I'd wager not many run the very verbose debug mode...) that it is indeed what we expected to happen (unfortunately), and give us a way out if we see something like _sock.send(b) instead, which MIGHT happen although has not (yet?) been observed in practice (I hope).

@whisperity whisperity requested review from dkrupp and removed request for bruntib, cservakt and dkrupp October 12, 2023 15:20
@dkrupp
Copy link
Member

dkrupp commented Oct 13, 2023

I think it is much better like this. It is ok to land.

@dkrupp dkrupp merged commit 0a3cc8b into Ericsson:master Oct 13, 2023
8 checks passed
@whisperity
Copy link
Contributor Author

whisperity commented Oct 17, 2023

@vodorok Could you please APPROVE this so it doesn't show as if you had not reviewed?


I think I have good news. After having run an analysis in which the (knowingly bogus!) checker I executed produced 178753 reports, I managed to test this in the real.

The store began today 10:58 Zulu, with the client emitting the following before going down to hang:

[INFO 2023-10-17 10:57] - Storing analysis results for run 'contour_v0.2.0.173_BenBlaise_readability-use-builtin-literals'
[...]
[INFO 2023-10-17 10:58] - Building report zip file (/tmp/tmpicjck0yo.zip) done.
[INFO 2023-10-17 10:58] - Compressing report zip file...
[INFO 2023-10-17 10:58] - Compressing report zip file done (174.5MiB / 5.4MiB).
[INFO 2023-10-17 10:58] - Storing results to the server...

The client is now hung on read(4, ...), FD 4 being the TCP socket, as expected. netstat on the client side does NOT show any active connection to the server, and lsof on the CodeChecker commands only show that it was a TCP socket, it does not show where it pointed to, like it does for active connections...

Compare a living SSH connection to what is seen on the store-side:

How a legit TCP connection in lsof looks like

How the hung TCP read looks like

(Note the /tmp/tmpicjck0yo.zip confirming that it is indeed the store we are discussing.)

In the server logs, there is no indication that the store of contour_... ever begun. The only indication is that a prior analysis of qtbase has killed something on the server, I have a partial stack trace from about 5 minutes earlier, but that's it:

[INFO 2023-10-17 10:52] - [qtbase_v6.2.0_BenBlaise_readability-use-builtin-literals] Unzip storage file...
[ERROR 2023-10-17 10:53] - Failed to extract received ZIP.
Traceback (most recent call last):
  File "/codechecker/lib/python3/codechecker_server/api/mass_store_run.py", line 88, in unzip

Nevertheless, and as expected, one hour after the store started executing, the timeout was triggered, explained, and the store command exited with an error status, as observed through CSA-TestBench:

[INFO 2023-10-17 10:58] - Storing results to the server...
[WARNING 2023-10-17 11:58] - Timeout watchdog hit 3600 seconds (1:00:00)
[INFO 2023-10-17 11:58] - Timeout was triggered during:
  File "[...]/codechecker/build/CodeChecker/lib/python3/codechecker_client/cmd/store.py", line 872, in main    client.massStoreRun(args.name,

[INFO 2023-10-17 11:58] - Timeout interrupted this low-level operation:
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)

[ERROR 2023-10-17 11:58] - Timeout!
	The server's reply did not arrive after 3600 seconds (1:00:00) elapsed since the server-side processing began.

	This does *NOT* mean that there was an issue with the run you were storing!
	The server might still be processing the results...
	However, it is more likely that the server had already finished, but the client did not receive a response.
	Usually, this is caused by the underlying TCP connection failing to signal a low-level disconnect.
	Clients potentially hanging indefinitely in these scenarios is an unfortunate and known issue.
		See http://github.com/Ericsson/codechecker/issues/3672 for details!

	This error here is a temporary measure to ensure an infinite hang is replaced with a well-explained timeout.
	A more proper solution will be implemented in a subsequent version of CodeChecker.
11:58:32 (INFO) <<--==--==--==--<<

11:58:32 (INFO) Execution finished, return code: 1

11:58:32 (INFO) [contour_v0.2.0.173_BenBlaise_readability-use-builtin-literals] Results stored.
11:58:35 (INFO) [contour] Postprocessed.
11:58:35 (INFO) [acid] Checking out project... 

Given the fact that this specific CI job is configured to not hard-error on store errors, it seems that we do the best of both worlds: store kills itself appropriately, but if the overarching job chooses to ignore it, it can continue appropriately.

@whisperity whisperity deleted the feat/store/timeout-message branch October 18, 2023 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLI 💻 Related to the command-line interface, such as the cmd, store, etc. commands usability 👍 Usability-related features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants