Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http: TLS handshake error from <promxy IP>:port: EOF #638

Open
winhvu opened this issue Feb 12, 2024 · 10 comments
Open

http: TLS handshake error from <promxy IP>:port: EOF #638

winhvu opened this issue Feb 12, 2024 · 10 comments
Labels

Comments

@winhvu
Copy link

winhvu commented Feb 12, 2024

We have a deployment as below:

image

and here is http_client we passed to Promxy:

http_client:
  dial_timeout: 10s
  tls_config:
     ca_file: <path_to_CA>
     cert_file: <path_to_cert>
     key_file: <path_to_key>
     insecure_skip_verify: false

When I perform multiple PromQL queries in parallel towards Promxy via curl command like this:

seq 1 200 | xargs -n1 -P10 curl --cert tls.crt --key tls.key --cacert ca.crt "https://promxy-endpoint:9091/api/v1/query?query=up"

I got lots of tls error messages http: TLS handshake error from <promxy IP>:port: EOF in our reverse proxy. It does not happen when the queries are sent in sequence.

I have decoded certs of both sides, client and server, they are all valid certificates.

Checking Promxy logs, there is no error messages; logs shows queries with returned successful code (200).

I would like to know if Promxy supports queries in parallel?

I have tried to test with query.max-concurrency=20 and the default one query.max-concurrency=-1; it does not help.

@winhvu
Copy link
Author

winhvu commented Feb 12, 2024

When I limit the number of the concurrencies to 1 by setting query.max-concurrency=1, there is no more tls error message.

@winhvu
Copy link
Author

winhvu commented Feb 13, 2024

and there is no issue at all if we send queries in parallel directly to reverse proxy, bypass the promxy pod.

@winhvu
Copy link
Author

winhvu commented Feb 14, 2024

I have tried with some another setups to narrow down the scope that could cause the problem:

  1. Use static server group rather than the dynamic one to confirm if the issue would cause by target discovery or not.

  2. Use one target server rather than 02 from the static group to check if there would have race condition while Promxy deals with multiple targets.

  3. Add more time to timeout and dial_timeout to see if the default times would be too short that Promxy might terminate the connection while tls handshake is not yet done.

But TLS error messages still show up in the logs.

@winhvu
Copy link
Author

winhvu commented Feb 19, 2024

@jacksontj Do you have any feedback/comments on this issue? Do you think there is race condition there in Promxy?

@jacksontj
Copy link
Owner

First off, thanks for reaching out!

I did some initial digging but your configuration seems incomplete (maybe just not included in the issue?). Specifically its missing the scheme configuration which would make all the requests downstream from promxy be http instead of https.

So in my local testing I have promxy -> nginx (with TLS) -> `demo.robustperception.io:9090

And I was able to get data working correctly and use a variation on your curl to test parallel usage:

seq 1 200 | xargs -n1 -P10 curl -k "https://localhost:8082/api/v1/query?query=up"

I have used promxy in front of HTTPs downstreams before without issue; so I don't expect you'll run into issues (other than the config; which is a bit odd because the prometheus scrape_config is a bit odd).

Hopefully that helps?

@winhvu
Copy link
Author

winhvu commented Feb 27, 2024

Thanks @jacksontj for the reply.

Yes, we do have scheme in the promxy configuration:

  - job_name: 'prometheus-pods'
    # anti-affinity for merging values in timeseries between hosts in the server_group
    anti_affinity: 15s

    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - testing-ns
    # configures the protocol scheme used for requests. Defaults to http
    scheme: https
    # options for promxy's HTTP client when talking to hosts in server_groups
    http_client:
      # dial_timeout controls how long promxy will wait for a connection to the downstream
      dial_timeout: 10s
      tls_config:
        ca_file: /run/secrets/trusted-root-cert/ca.crt
        cert_file: /run/secrets/prometheus-client-cert/tls.crt
        key_file: /run/secrets/prometheus-client-cert/tls.key
        insecure_skip_verify: false
    relabel_configs: []

The scheme http displayed in the log is misleading. However, I have enabled promxy log with trace level to see what the scheme and Prometheus endpoints promxy communicate with, and it is totally correct.

I have used promxy in front of HTTPs downstreams before without issue

The issue is not always showed up if the traffics towards promxy is low; it happens more frequently if we add more traffics like running the same curl command above from multiple terminals (e.g. I ran on 03 terminals in parallel)

@winhvu
Copy link
Author

winhvu commented Mar 4, 2024

Hi @jacksontj

Do you have a chance to reproducing the issue using the way I mentioned above?

@jacksontj
Copy link
Owner

I got lots of tls error messages http: TLS handshake error from :port: EOF in our reverse proxy. It does not happen when the queries are sent in sequence.

I just was re-reading this tonight -- and I realized I might have missed something here. Is this error you are describing in the reverse proxy you are running? If so that would indicate that the issue is between the proxy and promxy -- which could narrow things down a bit. I have been unable to reproduce issues with promxy talking to downstream HTTPs endpoints. If its just getting EOF I would actually wonder if the promxy process is OOMing or restarting (or whatever is terminating the TLS connection). Do you have any more information on the setup here and maybe some log output?

@winhvu
Copy link
Author

winhvu commented Sep 24, 2024

Thanks for looking into the issue. I will try to setup the environment and re-run the test. Can you please let me know what logs/info needed for troubleshooting? With my setup as mentioned in the description, it always shows up when running query via curl from multiple terminals in parallel (I ran on 03 terminals with my test)

@jacksontj
Copy link
Owner

@winhvu to debug more trace logs are generally all that is required. A tcpdump of the issue is also nice; but the trace log attempts to capture the relevant data :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants