Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos+ingress-nginx+grpc: impossible setup due missing host header #1507

Closed
danielmotaleite opened this issue Sep 10, 2019 · 61 comments
Closed

Comments

@danielmotaleite
Copy link
Contributor

danielmotaleite commented Sep 10, 2019

Thanos, Prometheus and Golang version used
quay.io/thanos/thanos:v0.7.0

What happened
i setup 2 kubernetes clusters, thanos query is in one cluster (and a local prometheus+sidecar) and need to query the remote kubernetes cluster thanos sidecar, all running in AWS (but not using eks)
I created one ingress-nginx with support for grpc with this config:

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: monitoring-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: prometheus-k8s-live-a.ops.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-k8s-live-a
          servicePort: 9090
  - host: prometheus-k8s-live-b.ops.example.com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-k8s-live-b
          servicePort: 9090
  tls:
  - hosts:
    - prometheus-k8s-live-a.ops.example.com
    - prometheus-k8s-live-b.ops.example.com
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
  name: grpc-ingress
  namespace: monitoring
spec:
  rules:
  - host: sidecar-k8s-live-a.ops.example.com
    http:
      paths:
      - backend:
          serviceName: sidecar-k8s-live-a
          servicePort: 10911
  - host: sidecar-k8s-live-b.ops.example.com
    http:
      paths:
      - backend:
          serviceName: sidecar-k8s-live-b
          servicePort: 10911
  tls:
  - hosts:
      - sidecar-k8s-live-a.ops.example.com
      - sidecar-k8s-live-b.ops.example.com

thanos query is using

--store=sidecar-k8s-live-a.ops.example.com.:443
--store=sidecar-k8s-live-a.ops.example.com.:443

I can connect to the prometheus url, but the sidecar grpc fail in thanos query.
looking to the nginx logs i can see the query arriving in http2, but returning 400. Doing a curl i can get a 503, but probably just because it is not really grpc. Changing the ingress-nginx logs to show the host header, i can see that curl is sending the correct host header, but for thanos query the logs show only _, it is either sending a empty one or a _.

What you expected to happen
I wanted to share the ingress to receive the https requests for prometheus and the grpc and using the host to redirect the request to the correct service. Sadly thanos query fail to send the host header and so the nginx can't apply the virtual_host search and servers the request from the default site.

Full logs to relevant components

Logs

172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.001 [] [] - - - - 477873c7a336618ccf06cf9c03fe8d97
172.27.119.135 - [172.27.119.135] - - [10/Sep/2019:15:02:40 +0000] "PRI * HTTP/2.0" 400 163 "-" "-" 0 0.003 [] [] - - - - c32e68975e91159a64326b55d4b72934
2019/09/10 15:02:40 [error] 1137#1137: *7155 upstream rejected request with error 2 while reading response header from upstream, client: 172.26.81.74, server: sidecar-k8s-live-a.ops.example.com, request: "PRI / HTTP/1.1", upstream: "grpc://100.96.136.200:10911", host: "sidecar-k8s-live-a.ops.example.com"
172.26.81.74 - [172.26.81.74] - - [10/Sep/2019:15:02:40 +0000] "PRI / HTTP/1.1" 502 163 "-" "curl/7.58.0" 189 0.002 [monitoring-sidecar-k8s-live-a-10911] [] 100.96.136.200:10911 0 0.004 502 4e08c4e8c6d8df148c5bc3a68d61ccf9

here we can see that the thanos query requests do not trigger the virtual_host, but the curl one, with host, is redirected to thanos sidecar

@danielmotaleite
Copy link
Contributor Author

just to make it clear:

As there is no host header, nginx will use the default site and the default site is plain http proxy, so it will never hit the grpc proxy config.

If thanos send the host header, nginx will load the correct config and deliver the request to the correct backend with the correct protocol.

@danielmotaleite
Copy link
Contributor Author

Found a reference for this problem in a several months old issue (not directly related to this)
#977 (comment)
and it basically confirms the problem

@bwplotka
Copy link
Member

Thanks for the report. As answered on the mentioned issue: Have you tried setting up some forward proxy? (e.g nginx or envoy)? I think that might solve your issues as it is more flexible in terms of what certs/credentials you sue.

In GRPC world there is no HOST header really. There is :authority You can read about this here: grpc/grpc#1022

We indeed don't set Authority manually as should be properly exposed by TLS credentials. We might add support for proxying through but again, it might the best if you set up some forward proxy for that.

How you nginx configuration looks like then if you are willing to share (:

@mheggeseth
Copy link

A reasonable way to work around this with NGINX Ingress Controller is to use the tcp-services-configmap feature to expose ports that route directly to sidecar-k8s-live-a:10911 (e.g. 11911) and sidecar-k8s-live-b:10911 (e.g. 12911) respectively.

Then your thanos-query options would look something like:

--store=sidecar-k8s-live.ops.example.com:11911  # routes to sidecar-k8s-live-a:10911
--store=sidecar-k8s-live.ops.example.com:12911  # routes to sidecar-k8s-live-b:10911

You still have to set up TLS on your own in both thanos-query and thanos-sidecar, but it helps avoid all the HTTP routing that the ingress controller tries to do for you.

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
@garenwen
Copy link

I had the same problem

@stale stale bot removed the stale label Jan 17, 2020
@garenwen
Copy link

Do you have a solution?

@cjf-fuller
Copy link

Another work around with the NGINX Ingress Controller is to use the --grpc-client-server-name flag on your thanos-query. This uses Server Name Indication, allowing the ingress controller to route the request correctly.

I believe this limits each querier to one server name only. Therefore you will need multiple queriers if you have multiple clusters to communicate between.

Your thanos-query args would include:

--grpc-client-server-name=sidecar-k8s-live.ops.example.com
--grpc-client-tls-secure
--store=dns+sidecar-k8s-live.ops.example.com:443

And your ingress annotations would include:

nginx.ingress.kubernetes.io/backend-protocol: GRPC
nginx.ingress.kubernetes.io/ssl-redirect: "true"

@shane-a-orme
Copy link

Cjf-fuller, this could work but it is important to understand that:

The Prometheus stateful set is labeled as thanos-store-api: "true" so that each pod gets discovered by the headless service. This headless service will be used by Thanos Query to query data across all the Prometheus instances. The replica might be up, but querying it will result in a small time gap for the period during which it was down. This isn’t fixed by having a second replica because it could be down at any moment, for example, during a rolling restart. These instances show how load balancing can fail. Be wary as this can lead to overwriting of your initial query and loss of host_header and nginx ability to virtual host_search.

@martip07
Copy link

martip07 commented Feb 6, 2020

I believe this limits each querier to one server name only. Therefore you will need multiple queriers if you have multiple clusters to communicate between.

Hi, are you sure that it will limit each querier to one server name only?

Regards,

@cjf-fuller
Copy link

@Than0s-coder, great point, we have set up a “central” Querier to target a “leaf” Querier and not the sidecars directly. But it sounds like this risk of overwriting the initial query and loss of host_headers would still be present?

@martip07, I am still very much a beginner with Thanos so could be totally wrong here. But, as far as I can tell the --grpc-client-server-name argument is a string, that sets ServerName in tls.Config. I am not too sure how I would make this a list of servernames.

I have seen that the TLS Extensions documentation talks of a ServerNameList struct. I cannot find many examples of this being used. I have tested this with a simple comma separated list (--grpc-client-server-name=test-1.myorg.com,test-2.myorg.com) which fails at the SSL handshake because the list is not enumerated at any point. So it fails as the wildcard certificate is valid for “*.myorg.com” and not “test-1.myorg.com,test-2.myorg.com”

@stale
Copy link

stale bot commented Mar 12, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 12, 2020
@stale stale bot closed this as completed Mar 19, 2020
@popsikle
Copy link

/reopen

@ageekymonk
Copy link

/reopen.

@kakkoyun kakkoyun removed the stale label Apr 6, 2020
@kakkoyun
Copy link
Member

kakkoyun commented Apr 6, 2020

Apparently this is still needed and valid.

@kakkoyun kakkoyun reopened this Apr 6, 2020
j3p0uk added a commit to j3p0uk/thanos that referenced this issue Apr 9, 2020
To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.
j3p0uk added a commit to j3p0uk/thanos that referenced this issue Apr 9, 2020
To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.
@j3p0uk
Copy link

j3p0uk commented Apr 9, 2020

Possible fix pushed that uses a flag to change behaviour based around the workaround detailed by @cjf-fuller in #1507 (comment).

If "grpc-client-dns-server-name" flag is specified then use the DNS provider to return back the name that was originally looked up and add the relevant dial options for the grpc at connection time. Allows a different SNI per store, based on the originally provided (dns+<name>:<port>) name.

j3p0uk added a commit to j3p0uk/thanos that referenced this issue Apr 10, 2020
To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.
j3p0uk added a commit to j3p0uk/thanos that referenced this issue Apr 10, 2020
To avoid needing a query per remote cluster, get the name to add to
the dial options from the dns provider when making the grpc connection.

Signed-off-by: JP Sullivan <jonpsull@cisco.com>
@stale
Copy link

stale bot commented May 9, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 9, 2020
@j3p0uk
Copy link

j3p0uk commented May 11, 2020

Awaiting design review for fix as per #2407 (comment)

@squat @bwplotka

@j3p0uk
Copy link

j3p0uk commented Jan 29, 2021

Sure. Do check the logs and see if they match in this case. Check that you can curl between the multiple clusters, etc. That looks like it could be a connectivity issue from your central cluster to query.my-local-domain.local:443 more than the issue detailed here, but that's a guess given there isn't much in the way of debug or logs to go on. Sorry I can't help more :)

@IbraheemAlSaady
Copy link

IbraheemAlSaady commented Jan 29, 2021

@j3p0uk I have tried this with grpcurl from the central cluster grpcurl -insecure query.my-local-domain.local:443 list

I'm getting this response:

grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Rules
thanos.Store

I did a describe as well grpcurl -insecure query.my-local-domain.local:443 describe and this is the output

grpc.health.v1.Health is a service:
service Health {
  rpc Check ( .grpc.health.v1.HealthCheckRequest ) returns ( .grpc.health.v1.HealthCheckResponse );
  rpc Watch ( .grpc.health.v1.HealthCheckRequest ) returns ( stream .grpc.health.v1.HealthCheckResponse );
}
grpc.reflection.v1alpha.ServerReflection is a service:
service ServerReflection {
  rpc ServerReflectionInfo ( stream .grpc.reflection.v1alpha.ServerReflectionRequest ) returns ( stream .grpc.reflection.v1alpha.ServerReflectionResponse );
}
Failed to resolve symbol "thanos.Rules": Symbol not found: thanos.Rules

Then grpcurl -insecure query.my-local-domain.local:443 thanos.Store/Info and this is the output

Error invoking method "thanos.Store/Info": target server does not expose service "thanos.Store"

@roysha1
Copy link

roysha1 commented Jan 29, 2021

Might change the ingress beckend protocol to grpcs

@IbraheemAlSaady
Copy link

IbraheemAlSaady commented Jan 29, 2021

@roysha1 sadly, that didn't do it

@IbraheemAlSaady
Copy link

I have updated my comment regarding the configuration I have to add the logs of the ingress controller

@Placidina
Copy link

Placidina commented Jan 30, 2021

I had a same problem
My solution was to use bitnami charts

Depends: bitnami/charts#5345 bitnami/charts#5344

my bitnami/kube-prometheus custom values:

prometheus:
  disableCompaction: true
  thanos:
    create: true
    objectStorageConfig:
      secretName: thanos-objstore-config
      secretKey: objstore.yml
    ingress:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
        nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
        nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-certs
        nginx.ingress.kubernetes.io/backend-protocol: GRPC

my bitnami/thanos custom values:

existingObjstoreSecret: thanos-objstore-config
query:
  hostAliases:
  - ip: "111.11.111.1"
    hostnames:
    - thanos.earth.cluster
  - ip: "111.11.112.1"
    hostnames:
    - thanos.mars.cluster
  stores:
  - thanos.earth.cluster:443
  - thanos.mars.cluster:443
  - thanos-storegateway.default.svc.cluster.local:10901
  dnsDiscovery:
    enabled: false
  grpcTLS:
    client:
      secure: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
compactor:
  enabled: true
storegateway:
  enabled: true
  grpc:
    tls:
      enabled: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----

thanos

@IbraheemAlSaady
Copy link

IbraheemAlSaady commented Feb 3, 2021

After a while of pulling my hair with this one, I managed to make it work. Just a note here, my ingress is on the Query instance not the sidecar, I would assume it'd work the same way for sidecar (didn't test that part)

My architecture is as follows:

Query (central cluster) -> Query (remote cluster :: ingress on this one) -> Sidecar (remote cluster) 
                        -> Sidecar (central cluster)

I'm deploying the stack with helm, here is my config

Remote & Central Cluster Prometheus Operator

prometheus:
  prometheusSpec:
    thanos:
      image: docker.io/bitnami/thanos
      tag: 0.17.2-scratch-r2
      objectStorageConfig:
        name: thanos
        key: objstore.yml

Remote Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    - kube-prometheus-prometheus-thanos.monitoring:10901 ## <-- thanos-sidecar

  ingress:
    enabled: false # disabled for http
    grpc:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx-internal
        nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
        ingress.kubernetes.io/ssl-redirect: "true"

      hostname: thanos.query.domain.local
      extraTls:
        - hosts:
            - thanos.query.domain.local
          secretName: thanos-grpc-tls

Central Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    ## this setup requires the thanos-sidecar tls to be 
    ## enabled. If you don't want to enable thanos-sidecar tls, you can modify the central cluster config by
    ## 1. create two query instances in the central cluster
    ## 2. first query instance has tls enabled on the client and store urls should only be the remote clusters' 
    ## 3. second query instance will point to the first query by service name, and to the local thanos-sidecar 
    - kube-prometheus-prometheus-thanos.monitoring:10901 
    - thanos.query.domain.local:443
  grpcTLS:
    client:
      secure: true
      existingSecret:
        name: thanos-grpc-tls
        keyMapping:
          ca-cert: ca.crt
          tls-cert: tls.crt
          tls-key: tls.key

Notice the certificate used for query ingress and for client TLS is the same certificate. I hope this helps someone

@stale
Copy link

stale bot commented Apr 7, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Apr 7, 2021
@GiedriusS GiedriusS removed the stale label May 4, 2021
@stale
Copy link

stale bot commented Jul 8, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 8, 2021
@ssadok
Copy link

ssadok commented Jul 9, 2021

Hello for me it's work when you add the extraflag --grpc-client-tls-secure and on the observee cluster i havec certman activated

@stale stale bot removed the stale label Jul 9, 2021
@stale
Copy link

stale bot commented Sep 8, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 8, 2021
@stale
Copy link

stale bot commented Oct 12, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 12, 2021
@countablecloud
Copy link

countablecloud commented Jan 13, 2022

Hello for me it's work when you add the extraflag --grpc-client-tls-secure and on the observee cluster i havec certman activated

For anyone who's bashing their heads against this, this single line fixed it; we have ingress enabled in both observer and remote.

using bitnami kube-prometheus and bitnami thanos on eks 1.21

heres the values
for thanos:

"bucketweb":
  "enabled": true
"compactor":
  "enabled": true
"minio":
  "auth":
    "rootPassword": "password"
    "rootUser": "user"
  "defaultBuckets": "thanos"
  "enabled": true
"objstoreConfig": |
  "config":
    "access_key": "user"
    "bucket": "thanos"
    "endpoint":minio.thanos-grafana.svc.cluster.local:9000
    "insecure": true
    "secret_key": "password"
  "type": "s3"
"query":
  "stores":
  - "thanos.cool-1.foobar.io:443"
  - "thanos.cool-2.foobar.io:443"
  "extraFlags":
  - "--grpc-client-tls-secure"
  # - "--grpc-client-server-name=kube-prometheus-prometheus-thanos"
  "dnsDiscovery":
    "enabled": false
  "ingress":
    "grpc": 
      "enabled": true
      "hostname": "thanos-querier.foobar.io"
      "tls": true"
      "annotations":
        "cert-manager.io/cluster-issuer": "letsencrypt-prod"
        "kubernetes.io/ingress.class": "nginx"
        "nginx.ingress.kubernetes.io/backend-protocol": "GRPC"
        "nginx.ingress.kubernetes.io/ssl-redirect": "true"
        "nginx.ingress.kubernetes.io/grpc-backend": "true"
"ruler":
  "alertmanagers":
  - "http://prometheus-operator-alertmanager.thanos-grafana.svc.cluster.local:9093"
  "config": |
    "groups":
    - "name": "metamonitoring"
      "rules":
      - "alert": "PrometheusDown"
        "expr": absent(up{prometheus="thanos-grafana/prometheus-operator"})
  "enabled": true
"storagegateway":
  "enabled": true

and for kube-prometheus:

"prometheus":
  "externalLabels":
    "cluster": "foobar"
  "thanos":
    "create": true
    "ingress":
      "annotations":
        "cert-manager.io/cluster-issuer": "letsencrypt-prod"
        "kubernetes.io/ingress.class": "nginx"
        "nginx.ingress.kubernetes.io/backend-protocol": "GRPC"
        "nginx.ingress.kubernetes.io/force-ssl-redirect": "true"
        "nginx.ingress.kubernetes.io/grpc-backend": "true"
        "nginx.ingress.kubernetes.io/protocol": "h2c"
        "nginx.ingress.kubernetes.io/proxy-read-timeout": "160"
      "enabled": true
      "hosts":
      - "name": "thanos.cool-1.foobar.io"
      "tls":
        - "hosts":
          - "thanos.cool-1.foobar.io"
          "secretName": "foobar-thanos-tls-secret"

@sagiv-zafrani
Copy link

sagiv-zafrani commented Feb 2, 2022

It seems that I have the same problem.

The topology is somewhat the same as @IbraheemAlSaady implementation
Query (central cluster) -> Query (central cluster :: grpc server without TLS) (connects to remote environment Ingress using mTLS) -> Query (remote cluster :: ingress on this one) -> Sidecar (remote cluster)
Setup achieved using -

  • thanos Helm chart by Bitnami
  • kube-prometheus-stack Helm chart by Prometheus-community

It seems that Thanos - Query on the observer cluster fails to Query remote stores (Ingress which listening on port 80,443 and the backend is thanos-query:GRPC).

My situation is a bit different, I'm using self signed certificate, the issuer in this case is thanos-query-ca (self-signed certificates generated by Helm).
When configuring the remote store (TLS listener of the Ingress deployed on the remote cluster) thanos is failing with updating the new node).

level=warn ts=2022-02-02T12:26:15.917113393Z caller=endpointset.go:500 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from <thanos_query_grpc_remote_environment_ingress_hostname>:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=<thanos_query_grpc_remote_environment_ingress_hostname>:443

When querying the Ingress using grpcurl, I receive the response below -
Failed to dial target host "<thanos_query_grpc_remote_environment_ingress_hostname>:443": x509: certificate signed by unknown authority

Is there a way to indicate to Thanos - Query to skip verifying the issuer of the certificate?

Thanks in advance

@danielmotaleite
Copy link
Contributor Author

@sagiv-zafrani maybe it is better to open a new issue for your user case, with a link to this one... this one is closed, so that will limit who may see your question

@NominalTrajectory
Copy link

@sagiv-zafrani, hi, were you able to solve your issue? I'm facing the same problem.

@sagiv-zafrani
Copy link

sagiv-zafrani commented Aug 22, 2022

@sagiv-zafrani, hi, were you able to solve your issue? I'm facing the same problem.

No, we used generated certificates signed by a CA instead.

@tal-ayalon
Copy link

@sagiv-zafrani @countablecloud @IbraheemAlSaady
Do you have to use a valid CA? Or a self-generated one is also OK?`

@audunsolemdal
Copy link

@sagiv-zafrani @countablecloud @IbraheemAlSaady Do you have to use a valid CA? Or a self-generated one is also OK?`

Self generated works OK. If connecting to nginx ingresses, your self signed certificate SANs must match the hostname you use in nginx.

From what I understand, Sagiv describes this solution https://krisztianfekete.org/solving-per-store-tls-limitation-in-thanos-query/

I tried having a single querier in the observer cluster querying sidecars via ingresses in other clusters which work fine. However, when I try to query the storage gateway services located in the observee cluster I struggle to get it working although they should be configured with the same certificate via --grpc-server-tls-cert

@junoriosity
Copy link

I had a same problem My solution was to use bitnami charts

Depends: bitnami/charts#5345 bitnami/charts#5344

my bitnami/kube-prometheus custom values:

prometheus:
  disableCompaction: true
  thanos:
    create: true
    objectStorageConfig:
      secretName: thanos-objstore-config
      secretKey: objstore.yml
    ingress:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx
        nginx.ingress.kubernetes.io/ssl-redirect: "true"
        nginx.ingress.kubernetes.io/auth-tls-verify-client: "on"
        nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-certs
        nginx.ingress.kubernetes.io/backend-protocol: GRPC

my bitnami/thanos custom values:

existingObjstoreSecret: thanos-objstore-config
query:
  hostAliases:
  - ip: "111.11.111.1"
    hostnames:
    - thanos.earth.cluster
  - ip: "111.11.112.1"
    hostnames:
    - thanos.mars.cluster
  stores:
  - thanos.earth.cluster:443
  - thanos.mars.cluster:443
  - thanos-storegateway.default.svc.cluster.local:10901
  dnsDiscovery:
    enabled: false
  grpcTLS:
    client:
      secure: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
compactor:
  enabled: true
storegateway:
  enabled: true
  grpc:
    tls:
      enabled: true
      cert: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
      key: |-
        -----BEGIN PRIVATE KEY-----
        ...
        -----END PRIVATE KEY-----
      ca: |-
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----

thanos

Hi @Placidina , many thanks for your suggestion. Could you perhaps outline your solution a little further, particularly regarding the whole certificate process. That would be very helpful. 🙂

@junoriosity
Copy link

junoriosity commented Jan 15, 2023

After a while of pulling my hair with this one, I managed to make it work. Just a note here, my ingress is on the Query instance not the sidecar, I would assume it'd work the same way for sidecar (didn't test that part)

My architecture is as follows:

Query (central cluster) -> Query (remote cluster :: ingress on this one) -> Sidecar (remote cluster) 
                        -> Sidecar (central cluster)

I'm deploying the stack with helm, here is my config

Remote & Central Cluster Prometheus Operator

prometheus:
  prometheusSpec:
    thanos:
      image: docker.io/bitnami/thanos
      tag: 0.17.2-scratch-r2
      objectStorageConfig:
        name: thanos
        key: objstore.yml

Remote Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    - kube-prometheus-prometheus-thanos.monitoring:10901 ## <-- thanos-sidecar

  ingress:
    enabled: false # disabled for http
    grpc:
      enabled: true
      annotations:
        kubernetes.io/ingress.class: nginx-internal
        nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
        ingress.kubernetes.io/ssl-redirect: "true"

      hostname: thanos.query.domain.local
      extraTls:
        - hosts:
            - thanos.query.domain.local
          secretName: thanos-grpc-tls

Central Cluster Query Config

existingObjstoreSecret: objstorage
clusterDomain: cluster.local
query:
  dnsDiscovery:
    enabled: false
  stores:
    ## this setup requires the thanos-sidecar tls to be 
    ## enabled. If you don't want to enable thanos-sidecar tls, you can modify the central cluster config by
    ## 1. create two query instances in the central cluster
    ## 2. first query instance has tls enabled on the client and store urls should only be the remote clusters' 
    ## 3. second query instance will point to the first query by service name, and to the local thanos-sidecar 
    - kube-prometheus-prometheus-thanos.monitoring:10901 
    - thanos.query.domain.local:443
  grpcTLS:
    client:
      secure: true
      existingSecret:
        name: thanos-grpc-tls
        keyMapping:
          ca-cert: ca.crt
          tls-cert: tls.crt
          tls-key: tls.key

Notice the certificate used for query ingress and for client TLS is the same certificate. I hope this helps someone

@IbraheemAlSaady I like your solution a lot. However, since I am using Cloudflare I only get tls.crt and tls.key from them. Could you help me getting the whole certificate stuff done? That would be awesome. 🙂

@Nashluffy
Copy link

Hello for me it's work when you add the extraflag --grpc-client-tls-secure and on the observee cluster i havec certman activated

This fixed it for us as well. We have ingress-nginx terminating TLS, so when querier hits it, it should expect to perform a TLS handshake.

We also needed the following annotations on our ingress resource

    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/ssl-redirect: "true"

We first made sure we could hit the endpoint from our local machines using grpcurl like

$ grpcurl <host>:<port> list
grpc.health.v1.Health
grpc.reflection.v1.ServerReflection
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info

but thanos query was still failing with errors like

ts=2024-04-10T08:35:15.380752187Z caller=endpointset.go:394 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"error reading server preface: EOF\"" address=<host>:<port>

Once we enabled --grpc-client-tls-secure query was successfully able to connect through ingress-nginx

@ArieLevs
Copy link

@Nashluffy possible for you to share Thanos/nginx configs, i'm hitting a similar issue (different error)
as when executing grpcurl -insecure my.thanos.prometheus.sidecar.dns:443 list the result is

grpc.health.v1.Health
grpc.reflection.v1.ServerReflection
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info

but Thanos query pod hits an error with

level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\""

my query chart values (from bitnami) with these values:

stores:
  - my.thanos.prometheus.sidecar.dns:443
extraFlags:
  - "--grpc-client-tls-secure"
  - "--grpc-client-tls-skip-verify"

i didn't created any custom certificates on Thanos query side, except lets encrypt on the Thanos sidecar ingress
since the grpcurl command works as expected, but calls from Thanos don't, i suspect the issue is somewhere with Thanos and not the ingress controller side?

  • the -insecure and --grpc-client-tls-skip-verify values here since i currently test again a non prod env and use (STAGING) Let's Encrypt certificates

@ArieLevs
Copy link

ArieLevs commented Jul 9, 2024

update regarding above error, turns out all was OK, just that storegateway expected the client to authenticate with him, once i set clientAuthEnabled: false on storegateway above TLS handshake error went way.

in addition if you are using network load balancer to access Thanos sidecars, implementing TLS on the LB level (rather then ingress) make sure to update the ALPN policy at least to HTTP2Optional, for a k8s service the annotation is

service.beta.kubernetes.io/aws-load-balancer-alpn-policy: HTTP2Optional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.