-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(log) log error on ssl handshake failure #28
Conversation
I think a TCP failure is fine. It does trigger me however on the absence of |
also wondering about the log line. It will be quite verbose in the logs on every failure. That said; having the wrong certificate is probably a hard config error, and should result in some log noise... |
@Tieske, thanks for the comments although I'm left wondering if you're in agreement on them or not so perhaps I need to try explain my point of view in a bit more detail. My, perhaps simplistic, view of health is as a binary status. An upstream is either healthy or not. Given this, a healthcheck can be succinctly defined in terms of what one considers healthy or not. This library, to my surprise, does not follow this definition. For the simple tcp checks this can be done - I can easily configure it to say that "I consider it healthy if I can make a tcp connection to a service within a given timeframe". But for http and https you need to explicitly define both healthy and unhealthy, with the effect being that this introduces a third status of "neither healthy nor unhealthy" if the check results in something the user has not explicitly defined. This effect can be seen both in the requirement to explicitly state the success/failure HTTP statuses (where the third state is any status code not explicitly defined in either of those). And, it can be seen in the fact that a TCP failure is not considered a HTTP/S failure. Under my, again simplistic(?), view of health, I am no longer (easily) able to define health as "Service returns HTTP 200 from a HTTPS request" and failure as not meeting that definition of health. Rather I need to define health (as per above), and also explicitly define every possible scenario under which I consider it a failure. I.e. if one cannot establish a HTTPS connection (ensure tcp_failures is set to something >0), and the full set of potential failure codes/HTTP statuses (500, 501, etc.). This goes against the principle of least astonishment for me at least. However, as mentioned, I guess my view could be somewhat simplistic and perhaps there are valid use cases for this 3rd health status. Thus I chose to leave this discussion as a side point initially and rather to focus on the fact that at a bare minimum I would expect to be able to reason from the logs as to what was going on with my healthchecks. In my scenario given my incorrect assumptions I had configured the checks as follows:
And here is an extract from the logs:
As you can see, this issue manifests itself over the course of several days. In the entire period, there is no indication that a healthcheck is not passing and at the point at which it ends, my upstream has been marked unhealthy and will never come back to a healthy status without manual intervention. It was only when I noticed that the TIMEOUT was incrementing over multiple days without being reset that I eventually started digging into this library to try understand why. Incidentally I had also wondered why I never saw logs relating to what HTTP status was being returned by the upstream and when I noticed the log lines relating to this in the codebase, I realised the code must have been exiting at an earlier stage. Now, I know that in my current case this was a configuration issue which yes, would make a lot of noise but I hope that you would agree that this is perfectly valid log noise. For another example that is perhaps more applicable to a non-configuration issue, please consider that this could also present in a live environment in the case of certificate failure. Regardless of the underlying reason I would definitely want to know that the healthcheck was Some final notes.
|
Also, I noticed this comment when browsing through the other open PRs #22 (review). I'd be happy to adjust my commit message to comply with https://github.com/Kong/kong/blob/master/CONTRIBUTING.md#commit-message-format (?) although I'm unclear as to what |
yes, please update the commit format; so a TCP failure is also an HTTP failure, if the check is HTTP(s) wondering how to deal with the timeouts ?? |
also can you add a section |
Yes
This is already the case. There is only one definition of success which results in all failure counters being reset.
My personal preference would be to have a single attribute, lua-resty-healthcheck/lib/resty/healthcheck.lua Lines 752 to 755 in 0250046
and lua-resty-healthcheck/lib/resty/healthcheck.lua Lines 757 to 765 in 0250046
But, given that this functionality exists I expect there are folks out there who do appreciate the ability to treat these different types of failures individually? So, to avoid a breaking change by replacing |
done and done |
I'll open a new issue regarding the counters |
Ran into a bit of a weird issue with this library. My upstream in Kong were periodically getting marked as unhealthy due to timeout issues but yet were never coming back to a healthy status. Yet, if I manually marked the upstream as healthy everything would work fine. After some digging I noticed that over the course of time, TCP timeout counter was slowly being incremented but yet never seemed to be reset. After some more digging (and the addition of this log line) I finally figured out I had SSL handshake issues as I hadn't configured
lua_ssl_trusted_certificate
. However, I never noticed this issue because I hadtcp_failures
set to 0 and SSL handshake is seen as a TCP failure rather than HTTP/S failure. I.e. in my case, my upstreams were healthy to begin with but they were never actually passing a healthcheck due to the SSL issues. On that note, I do wonder if this error (and some of the subsequent ones) really should be handled as TCP failures. I know that strictly they are so, but given that a specific TCP check has already exited by this point in time in the code, perhaps these other errors should be reported in such a way that the actual healthcheck type (http/s) fails too? Happy to raise this as an issue for further discussion if you feel this would be justified?