Loki uses wrong AdvertiseAddr for memebership #5610

Oloremo · 2022-03-12T10:52:09Z

Describe the bug
I'm trying to setup 3 nodes Loki 2.4.2 cluster in Hashicorp Nomad environment using the bridge networking. So inside the loki container is an internal Nomad network(172.26.64.x) that is unreachable from the outside.

I map and expose ports 3100, 7946, and 9096 so they're reachable if you would access them via real node IP:port.
I also configured the ring config to set the right advertise addr:

common:
  ring:
    instance_addr: 172.31.23.68
    instance_id: loki-01

Full config from one node:
https://gist.github.com/Oloremo/f64be59cea85bc9e01fe262b9b158006

But in the logs I see that Loki trying to access the internal network:

level=debug ts=2022-03-12T10:21:33.502827095Z caller=tcp_transport.go:389 msg=FinalAdvertiseAddr advertiseAddr=172.26.64.24 advertisePort=7946
[CUT]
level=warn ts=2022-03-12T10:21:39.715337336Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.26.64.21:7946 err="dial tcp 172.26.64.21:7946: connect: no route to host"

Full logs: https://gist.github.com/Oloremo/e8de36fb505b74241b59234dccdf149b

So I think Loki ignores ring configuration and still trying to guess the network?

To Reproduce
Steps to reproduce the behavior:

Started 3 nodes Loki 2.4.2 cluster in bridged networking
Configure ring to advertise on different ip

Expected behavior
Setting the instance_addr should remove network guessing.

Environment:

Infrastructure: Hashicorp Nomad
Deployment tool: Nomad HCL

The text was updated successfully, but these errors were encountered:

miconx · 2022-03-18T11:42:16Z

similar problem here - would like to get your nomad config to compare it with mine
do you use consul connect to bridge the containers?

Oloremo · 2022-03-18T12:51:37Z

do you use consul connect to bridge the containers?

As far as I understood it would be impossible to use Consul Connect here since membership alg seems like need to know all peers addresses and Consul Connect hide it behind a single endpoint.

So right now I just trying to make it work with bridge networking and port mapping.

It works with host networking.

DylanGuedes · 2022-03-22T16:38:04Z

are you using SSD mode or monolythic? what do you see when you access /ring? which flags are you using?

your configuration looks fine, except that if all the three nodes are using the same instance_addr, from what I understand you'll have three different nodes trying to advertise the same address in the ring.

ddreier · 2022-03-24T02:00:38Z

I actually just started working on setting up a test Loki cluster in a Nomad environment and I am running into the exact same issue!

Config excerpt, from the Nomad Job's template stanza:

common:
  ring:
    instance_addr: {{ env "NOMAD_IP_loki_memberlist" }}

Nomad substitutes {{ env "NOMAD_IP_loki_memberlist" }} with the IP address of the host that Loki is running on (relevant Nomad docs), and I can see that it is configured correctly when I run loki with the -print-config-stderr flag. But in Loki's logs, it's trying to connect to the internal Docker IP address from within the Docker network on each Nomad Node.

Excerpt from Loki's config dump (the IP address is different for each instance of Loki):

common:
<snip>
  ring:
<snip>
    instance_interface_names:
    - eth0
    - en0
    - lo
    instance_port: 0
    instance_addr: 10.x.x.21
    instance_availability_zone: ""
<snip>
    instance_interface_names:
    - eth0
    - en0
    - lo
    instance_port: 0
    instance_addr: 10.x.x.21
querier:
  query_timeout: 1m0s
  tail_max_duration: 1h0m0s
<snip>
ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
<snip>
    unregister_on_shutdown: true
    readiness_check_ring_health: true
    address: 10.x.x.21
    port: 0
    id: 04b682673488

Excerpt from Loki logs:

level=warn ts=2022-03-24T01:39:45.667677533Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.668146054Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.8:7946 err="dial tcp 172.17.0.8:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.868258076Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.8:7946 err="dial tcp 172.17.0.8:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.8687917Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"
level=warn ts=2022-03-24T01:39:45.972569324Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.17.0.16:7946 err="dial tcp 172.17.0.16:7946: connect: connection refused"

Example Nomad Job: https://gist.github.com/ddreier/3d9c93a555aa36058ae1cf907b98ca51

Oloremo · 2022-03-24T12:19:23Z

@DylanGuedes

are you using SSD mode or monolithic?

monolithic for PoC phase.

what do you see when you access /ring?

IPs are correct in the ring endpoint.

which flags are you using?

What do you mean by flags?

your configuration looks fine, except that if all the three nodes are using the same instance_addr,

This config is from one node, instance_addr and name are different and correct on the others.

The issue is that Loki still trying to use the internal network as per debug logs I added.

Oloremo · 2022-03-24T12:21:06Z

@ddreier
It works with host network and it seems like it won't be able to work with bridged one at all since all members of the ring needs to access all members, yourself included.

And you can't do internal_network -> host_network -> bridge -> internal_network for yourself. At least without some iptables tuning.

ddreier · 2022-03-24T16:57:52Z

@Oloremo thanks, I was able to eventually get my POC up and running with setting the network_mode to host. Will just have to continue that practice for now until we can configure which IP address Loki Advertises.

gassman · 2022-03-31T07:01:58Z

There appears to be an undocumented instance_interface_names option in the frontend section (at least in 2.4.2). Here are a list off all instance_interface_names available when you run Loki with -print-config-stderr:

common:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
distributor:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
frontend:
  instance_interface_names:
  - eth0
  - en0
  - lo
ruler:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo

As I am not running Loki in a container or Kubernetes, this flag in frontend defaults to eth0, en0, lo if not set. The only interface I am using in this list is lo. Setting instance_interface_names in frontend to the actual NIC device name made querier frontend work like a charm. No more delays, timeouts in SSD mode

stale · 2022-05-01T03:08:15Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

DylanGuedes · 2022-05-01T10:27:25Z

There appears to be an undocumented instance_interface_names option in the frontend section (at least in 2.4.2). Here are a list off all instance_interface_names available when you run Loki with -print-config-stderr:
common:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
distributor:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
frontend:
  instance_interface_names:
  - eth0
  - en0
  - lo
ruler:
  ring:
    instance_interface_names:
    - eth0
    - en0
    - lo
As I am not running Loki in a container or Kubernetes, this flag in frontend defaults to eth0, en0, lo if not set. The only interface I am using in this list is lo. Setting instance_interface_names in frontend to the actual NIC device name made querier frontend work like a charm. No more delays, timeouts in SSD mode

Glad to hear that you could find a work-around! But FYI, we've also added a new instance_interface_names inside the common section but outside the ring. i.e you have:

common:
-  ring:
+  instance_interface_names:
-    instance_interface_names:

but now you could instead have

+common:
+  instance_interface_names:

biggest difference being that the common instance_interface_names is applied also to the frontend, which doesn't happen for the configuration inside the ring (since the frontend isn't a ring).

Oloremo · 2022-06-28T12:18:16Z

@kavirajk any updates? Still unsure how we could run Loki with bridged networking.

stale · 2022-08-13T19:35:30Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

Oloremo · 2022-08-13T19:46:16Z

not stale

Tahvok · 2022-09-19T22:04:19Z

I would like to add here that it can easily be reproduced using docker containers with network bridge (not k8s) when trying to deploy a distributed deployment when each service (loki target) is running on it's own instance.
instance_addr is completely ignored.
The easiest would be to bring up 2 instances, one for distributor and one for ingester. Run docker loki on each one, and configure instance_addr to be its instance ip and configure a dns record for memberlist ring with both instance ips records.
You will receive on each loki container an error similar to this: Got ping for unexpected node.
And you will notice that it's getting the internal docker bridge address instead of the configured instance_addr

ddaka · 2022-10-18T08:05:15Z

I'm having the same issue, I'm trying to run Loki under docker in two different hosts, Loki always advertises the internal docker IP which is not reachable from the other member.
instance_addr is either broken or the documentation doesn't explain well how to use it.

corest · 2022-11-02T19:50:56Z

I had the same issue. It was annoying as same setup for Mimir works fine. And then I just copied advertise address/port configuration from Mimir into Loki (advertise_addr and advertise_port) and now it works

E.g.

memberlist:
  gossip_nodes: 2
  [[ $alloc_index := env "NOMAD_ALLOC_INDEX" ]]
  [[- $gossip_service := (print "loki-gossip-" $alloc_index) ]]
  [[ range service $gossip_service ]]
  advertise_addr: [[ .Address ]]
  advertise_port: [[ .Port ]]
  [[ end ]]
  join_members: [ [[ range service "loki-gossip-0" ]][[ .Address ]]:[[ .Port ]][[end]], [[ range service "loki-gossip-1" ]][[ .Address ]]:[[ .Port ]][[end]] ]

Oloremo · 2022-11-02T21:18:51Z

wait, Loki doesn't list advertise_addr and advertise_port as available configurations.
https://grafana.com/docs/loki/latest/configuration/#memberlist_config

Docs issue?..

corest · 2022-11-03T12:29:20Z

That actually didn't fix the issue :/
Those parameters are accepted but doesn't look like have any effect

corest · 2022-11-03T14:36:36Z

Ok, so here the part of config with all the advertise addresses and ports replaced

memberlist:
  [[ $alloc_index := env "NOMAD_ALLOC_INDEX" ]]
  [[- $gossip_service := (print "loki-gossip-" $alloc_index) ]]
  [[ range service $gossip_service ]]
  advertise_addr: [[ .Address ]]
  advertise_port: [[ .Port ]]
  [[ end ]]
  join_members: [ [[ range service "loki-gossip-0" ]][[ .Address ]]:[[ .Port ]][[end]], [[ range service "loki-gossip-1" ]][[ .Address ]]:[[ .Port ]][[end]] ]

[[- $grpc_service := (print "loki-grpc-" $alloc_index) ]]
[[ range service $grpc_service ]]
common:
  ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
ruler:
  ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
distributor:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
frontend:
  address: [[ .Address ]]
  port: [[ .Port ]]
compactor:
  compactor_ring:
    instance_addr: [[ .Address ]]
    instance_port: [[ .Port ]]
[[ end ]]

and for nomad network/services I have

    network {
      port "gossip" {
        to = 7946
        host_network = "private"
      }
      port "grpc" {
        to = 9095
        host_network = "private"
      }
    }
...
    service {
      name = "loki-gossip-${NOMAD_ALLOC_INDEX}"
      port = "gossip"
    }

    service {
      name = "loki-grpc-${NOMAD_ALLOC_INDEX}"
      port = "grpc"
    }

All those advertise_addr and advertise_port configs are undocumented but work the same way as in Mimir.
With redefining all those fields I got it working and was able to ship the logs to Loki with Vector (https://vector.dev/docs/reference/configuration/sinks/loki/)

djuarezg · 2022-11-28T10:13:05Z

Can confirm that using the undocumented advertise_addr works for me as well, otherwise, even when specifying the correct interface name it still tries to use the wrong address to determine the health status for the ring...

Basically without this advertise_addr option node is detected as unhealthy, then retires with the correct IP and reports later as healthy.

stale bot added the stale A stale issue or PR that will automatically be closed. label May 1, 2022

stale bot removed the stale A stale issue or PR that will automatically be closed. label May 1, 2022

kavirajk added the component/ring label May 30, 2022

fredrikcarlbom mentioned this issue Jul 7, 2023

memberlist advertise_addr not used when advertising node address #9887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loki uses wrong AdvertiseAddr for memebership #5610

Loki uses wrong AdvertiseAddr for memebership #5610

Oloremo commented Mar 12, 2022

miconx commented Mar 18, 2022 •

edited

Loading

Oloremo commented Mar 18, 2022

DylanGuedes commented Mar 22, 2022

ddreier commented Mar 24, 2022

Oloremo commented Mar 24, 2022 •

edited

Loading

Oloremo commented Mar 24, 2022

ddreier commented Mar 24, 2022

gassman commented Mar 31, 2022

stale bot commented May 1, 2022

DylanGuedes commented May 1, 2022 •

edited

Loading

Oloremo commented Jun 28, 2022

stale bot commented Aug 13, 2022

Oloremo commented Aug 13, 2022

Tahvok commented Sep 19, 2022

ddaka commented Oct 18, 2022

corest commented Nov 2, 2022

Oloremo commented Nov 2, 2022

corest commented Nov 3, 2022

corest commented Nov 3, 2022

djuarezg commented Nov 28, 2022

Loki uses wrong AdvertiseAddr for memebership #5610

Loki uses wrong AdvertiseAddr for memebership #5610

Comments

Oloremo commented Mar 12, 2022

miconx commented Mar 18, 2022 • edited Loading

Oloremo commented Mar 18, 2022

DylanGuedes commented Mar 22, 2022

ddreier commented Mar 24, 2022

Oloremo commented Mar 24, 2022 • edited Loading

Oloremo commented Mar 24, 2022

ddreier commented Mar 24, 2022

gassman commented Mar 31, 2022

stale bot commented May 1, 2022

DylanGuedes commented May 1, 2022 • edited Loading

Oloremo commented Jun 28, 2022

stale bot commented Aug 13, 2022

Oloremo commented Aug 13, 2022

Tahvok commented Sep 19, 2022

ddaka commented Oct 18, 2022

corest commented Nov 2, 2022

Oloremo commented Nov 2, 2022

corest commented Nov 3, 2022

corest commented Nov 3, 2022

djuarezg commented Nov 28, 2022

miconx commented Mar 18, 2022 •

edited

Loading

Oloremo commented Mar 24, 2022 •

edited

Loading

DylanGuedes commented May 1, 2022 •

edited

Loading