Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki uses wrong AdvertiseAddr for memebership #5610

Open
Oloremo opened this issue Mar 12, 2022 · 20 comments
Open

Loki uses wrong AdvertiseAddr for memebership #5610

Oloremo opened this issue Mar 12, 2022 · 20 comments

Comments

@Oloremo
Copy link

Oloremo commented Mar 12, 2022

Describe the bug
I'm trying to setup 3 nodes Loki 2.4.2 cluster in Hashicorp Nomad environment using the bridge networking. So inside the loki container is an internal Nomad network(172.26.64.x) that is unreachable from the outside.

I map and expose ports 3100, 7946, and 9096 so they're reachable if you would access them via real node IP:port.
I also configured the ring config to set the right advertise addr:

common:
  ring:
    instance_addr: 172.31.23.68
    instance_id: loki-01

Full config from one node:
https://gist.github.com/Oloremo/f64be59cea85bc9e01fe262b9b158006

But in the logs I see that Loki trying to access the internal network:

level=debug ts=2022-03-12T10:21:33.502827095Z caller=tcp_transport.go:389 msg=FinalAdvertiseAddr advertiseAddr=172.26.64.24 advertisePort=7946
[CUT]
level=warn ts=2022-03-12T10:21:39.715337336Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.26.64.21:7946 err="dial tcp 172.26.64.21:7946: connect: no route to host"

Full logs: https://gist.github.com/Oloremo/e8de36fb505b74241b59234dccdf149b

So I think Loki ignores ring configuration and still trying to guess the network?

To Reproduce
Steps to reproduce the behavior:

  1. Started 3 nodes Loki 2.4.2 cluster in bridged networking
  2. Configure ring to advertise on different ip

Expected behavior
Setting the instance_addr should remove network guessing.

Environment:

  • Infrastructure: Hashicorp Nomad
  • Deployment tool: Nomad HCL
@miconx
Copy link

miconx commented Mar 18, 2022

similar problem here - would like to get your nomad config to compare it with mine
do you use consul connect to bridge the containers?

@Oloremo
Copy link
Author

Oloremo commented Mar 18, 2022

do you use consul connect to bridge the containers?

As far as I understood it would be impossible to use Consul Connect here since membership alg seems like need to know all peers addresses and Consul Connect hide it behind a single endpoint.

So right now I just trying to make it work with bridge networking and port mapping.

It works with host networking.

@DylanGuedes
Copy link
Contributor

are you using SSD mode or monolythic? what do you see when you access /ring? which flags are you using?

your configuration looks fine, except that if all the three nodes are using the same instance_addr, from what I understand you'll have three different nodes trying to advertise the same address in the ring.

@ddreier
Copy link

ddreier commented Mar 24, 2022

I actually just started working on setting up a test Loki cluster in a Nomad environment and I am running into the exact same issue!

Config excerpt, from the Nomad Job's template stanza:

common:
  ring:
    instance_addr: {{ env "NOMAD_IP_loki_memberlist" }}

Nomad substitutes {{ env "NOMAD_IP_loki_memberlist" }} with the IP address of the host that Loki is running on (relevant Nomad docs), and I can see that it is configured correctly when I run loki with the -print-config-stderr flag. But in Loki's logs, it's trying to connect to the internal Docker IP address from within the Docker network on each Nomad Node.

Excerpt from Loki's config dump (the IP address is different for each instance of Loki):

common:
<snip>
  ring:
<snip>
    instance_interface_names:
    - eth0
    - en0
    - lo
    instance_port: 0
    instance_addr: 10.x.x.21
    instance_availability_zone: ""
<snip>
    instance_interface_names:
    - eth0
    - en0
    - lo
    instance_port: 0
    instance_addr: 10.x.x.21
querier:
  query_timeout: 1m0s
  tail_max_duration: 1h0m0s
<snip>
ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
<snip>
    unregister_on_shutdown: true
    readiness_check_ring_health: true
    address: 10.x.x.21
    port: 0
    id: 04b682673488

Excerpt from Loki logs:

level=warn ts=2022-03-24T01:39:45.667677533Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.668146054Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.8:7946 err="dial tcp 172.17.0.8:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.868258076Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.8:7946 err="dial tcp 172.17.0.8:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.8687917Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.972569324Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"

Example Nomad Job: https://gist.github.com/ddreier/3d9c93a555aa36058ae1cf907b98ca51

@Oloremo
Copy link
Author

Oloremo commented Mar 24, 2022

@DylanGuedes

are you using SSD mode or monolithic?

monolithic for PoC phase.

what do you see when you access /ring?

IPs are correct in the ring endpoint.

which flags are you using?

What do you mean by flags?

your configuration looks fine, except that if all the three nodes are using the same instance_addr,

This config is from one node, instance_addr and name are different and correct on the others.

The issue is that Loki still trying to use the internal network as per debug logs I added.

@Oloremo
Copy link
Author

Oloremo commented Mar 24, 2022

@ddreier
It works with host network and it seems like it won't be able to work with bridged one at all since all members of the ring needs to access all members, yourself included.

And you can't do internal_network -> host_network -> bridge -> internal_network for yourself. At least without some iptables tuning.

@ddreier
Copy link

ddreier commented Mar 24, 2022

@Oloremo thanks, I was able to eventually get my POC up and running with setting the network_mode to host. Will just have to continue that practice for now until we can configure which IP address Loki Advertises.

@gassman
Copy link

gassman commented Mar 31, 2022

There appears to be an undocumented instance_interface_names option in the frontend section (at least in 2.4.2). Here are a list off all instance_interface_names available when you run Loki with -print-config-stderr:

common:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
distributor:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
frontend:
  instance_interface_names:
  - eth0
  - en0
  - lo
ruler:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo

As I am not running Loki in a container or Kubernetes, this flag in frontend defaults to eth0, en0, lo if not set. The only interface I am using in this list is lo. Setting instance_interface_names in frontend to the actual NIC device name made querier frontend work like a charm. No more delays, timeouts in SSD mode

@stale
Copy link

stale bot commented May 1, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label May 1, 2022
@DylanGuedes
Copy link
Contributor

DylanGuedes commented May 1, 2022

There appears to be an undocumented instance_interface_names option in the frontend section (at least in 2.4.2). Here are a list off all instance_interface_names available when you run Loki with -print-config-stderr:

common:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
distributor:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
frontend:
  instance_interface_names:
  - eth0
  - en0
  - lo
ruler:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo

As I am not running Loki in a container or Kubernetes, this flag in frontend defaults to eth0, en0, lo if not set. The only interface I am using in this list is lo. Setting instance_interface_names in frontend to the actual NIC device name made querier frontend work like a charm. No more delays, timeouts in SSD mode

Glad to hear that you could find a work-around! But FYI, we've also added a new instance_interface_names inside the common section but outside the ring. i.e you have:

common:
-  ring:
+  instance_interface_names:
-    instance_interface_names:

but now you could instead have

+common:
+  instance_interface_names:

biggest difference being that the common instance_interface_names is applied also to the frontend, which doesn't happen for the configuration inside the ring (since the frontend isn't a ring).

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label May 1, 2022
@Oloremo
Copy link
Author

Oloremo commented Jun 28, 2022

@kavirajk any updates? Still unsure how we could run Loki with bridged networking.

@stale
Copy link

stale bot commented Aug 13, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@Oloremo
Copy link
Author

Oloremo commented Aug 13, 2022

not stale

@Tahvok
Copy link

Tahvok commented Sep 19, 2022

I would like to add here that it can easily be reproduced using docker containers with network bridge (not k8s) when trying to deploy a distributed deployment when each service (loki target) is running on it's own instance.
instance_addr is completely ignored.
The easiest would be to bring up 2 instances, one for distributor and one for ingester. Run docker loki on each one, and configure instance_addr to be its instance ip and configure a dns record for memberlist ring with both instance ips records.
You will receive on each loki container an error similar to this: Got ping for unexpected node.
And you will notice that it's getting the internal docker bridge address instead of the configured instance_addr

@ddaka
Copy link

ddaka commented Oct 18, 2022

I'm having the same issue, I'm trying to run Loki under docker in two different hosts, Loki always advertises the internal docker IP which is not reachable from the other member.
instance_addr is either broken or the documentation doesn't explain well how to use it.

@corest
Copy link

corest commented Nov 2, 2022

I had the same issue. It was annoying as same setup for Mimir works fine. And then I just copied advertise address/port configuration from Mimir into Loki (advertise_addr and advertise_port) and now it works

E.g.

memberlist:
  gossip_nodes: 2
  [[ $alloc_index := env "NOMAD_ALLOC_INDEX" ]]
  [[- $gossip_service := (print "loki-gossip-" $alloc_index) ]]
  [[ range service $gossip_service ]]
  advertise_addr: [[ .Address ]]
  advertise_port: [[ .Port ]]
  [[ end ]]
  join_members: [ [[ range service "loki-gossip-0" ]][[ .Address ]]:[[ .Port ]][[end]], [[ range service "loki-gossip-1" ]][[ .Address ]]:[[ .Port ]][[end]] ]

@Oloremo
Copy link
Author

Oloremo commented Nov 2, 2022

wait, Loki doesn't list advertise_addr and advertise_port as available configurations.
https://grafana.com/docs/loki/latest/configuration/#memberlist_config

Docs issue?..

@corest
Copy link

corest commented Nov 3, 2022

That actually didn't fix the issue :/
Those parameters are accepted but doesn't look like have any effect

@corest
Copy link

corest commented Nov 3, 2022

Ok, so here the part of config with all the advertise addresses and ports replaced

memberlist:
  [[ $alloc_index := env "NOMAD_ALLOC_INDEX" ]]
  [[- $gossip_service := (print "loki-gossip-" $alloc_index) ]]
  [[ range service $gossip_service ]]
  advertise_addr: [[ .Address ]]
  advertise_port: [[ .Port ]]
  [[ end ]]
  join_members: [ [[ range service "loki-gossip-0" ]][[ .Address ]]:[[ .Port ]][[end]], [[ range service "loki-gossip-1" ]][[ .Address ]]:[[ .Port ]][[end]] ]

[[- $grpc_service := (print "loki-grpc-" $alloc_index) ]]
[[ range service $grpc_service ]]
common:
  ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
ruler:
  ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
distributor:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
frontend:
  address: [[ .Address ]]
  port: [[ .Port ]]
compactor:
  compactor_ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
[[ end ]]

and for nomad network/services I have

    network {
      port "gossip" {
        to = 7946
        host_network = "private"
      }
      port "grpc" {
        to = 9095
        host_network = "private"
      }
    }
...
    service {
      name = "loki-gossip-${NOMAD_ALLOC_INDEX}"
      port = "gossip"
    }

    service {
      name = "loki-grpc-${NOMAD_ALLOC_INDEX}"
      port = "grpc"
    }

All those advertise_addr and advertise_port configs are undocumented but work the same way as in Mimir.
With redefining all those fields I got it working and was able to ship the logs to Loki with Vector (https://vector.dev/docs/reference/configuration/sinks/loki/)

@djuarezg
Copy link

Can confirm that using the undocumented advertise_addr works for me as well, otherwise, even when specifying the correct interface name it still tries to use the wrong address to determine the health status for the ring...

Basically without this advertise_addr option node is detected as unhealthy, then retires with the correct IP and reports later as healthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants