Duplicate addresses in view after upgrade #845

sonofsounds · 2024-10-15T09:30:31Z

sonofsounds
Oct 15, 2024

Hello, recently we upgraded JGroups from version 3.6.3.Final to 4.2.30.Final and noticed some issues related to view changes. For initial membership discovery, we are using TCPGOSSIP. Below you can see the two configuration files(old and new) that are used for setting up the protocol stack.

New config used with JGroups 4.2.30.Final:

<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="urn:org:jgroups"
        xsi:schemaLocation="http://www.jgroups.org/schema/jgroups.xsd">
    <TCP bind_port="${jgroups.tcp.bind_port}" sock_conn_timeout="300"/>
    <TCPGOSSIP initial_hosts="${jgroups.tcpgossip.initial_host:localhost[12001]}"/>
    <MERGE3/>
    <FD_ALL timeout="12000" interval="3000" timeout_check_interval="2000"/>
    <VERIFY_SUSPECT/>
    <pbcast.NAKACK2 use_mcast_xmit="false"/>
    <UNICAST3/>
    <pbcast.STABLE/>
    <pbcast.GMS/>
    <FRAG2/>
</config>

Old config used with JGroups 3.6.3.Final:

<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="urn:org:jgroups"
        xsi:schemaLocation="http://www.jgroups.org/schema/jgroups-3.6.xsd">
    <TCP bind_port="${jgroups.tcp.bind_port}" use_send_queues="true" sock_conn_timeout="300"/>
    <TCPGOSSIP initial_hosts="${jgroups.tcpgossip.initial_host:localhost[12001]}"/>
    <MERGE3/>
    <FD/>
    <VERIFY_SUSPECT/>
    <pbcast.NAKACK2 use_mcast_xmit="false"/>
    <UNICAST3/>
    <pbcast.STABLE/>
    <pbcast.GMS/>
    <FRAG2/>
</config>

We have noticed that duplicate addresses appear in the view, and this seems to happen when certain members are disconnected consecutively. Below is a log where the issue can be observed:

2024-10-07 08:50:17,322 INFO domain.JGroupsMessageDispatcher [JGroupsChannelMediator-thread-2] - New view : [capp+OMSEC47-P1LG+10.12.2.12, capp+OMSEC57-P1LG+10.12.60.18, capp+omsec3-p1lg.myhost.com+10.2.12.137, capp+OMSEC50-P1LG+10.12.60.13, capp+omsec9-p1lg.myhost.com+10.2.12.102, capp+OMSEC37-P1LG+10.15.15.13, capp+omsec5-p1lg.myhost.com+10.2.12.77, capp+OMSEC40-P1LG+127.0.1.1, capp+OMSEC62-P1LG+10.12.60.23, capp+OMSEC56-P1LG+10.12.60.17, capp+omsec6-p1lg.myhost.com+10.2.12.67, masterapp+omsec1-p1lg+10.3.12.53, capp+OMSEC55-P1LG+10.12.60.12, capp+omsec18-p1lg.myhost.com+10.2.12.26, capp+omsec19-p1lg.myhost.com+10.2.12.27, capp+omsec42-p1lg.myhost.com+127.0.1.1, capp+omsec27-p1lg.myhost.com+10.2.12.85, capp+OMSEC59-P1LG+10.12.60.20, capp+omsec24-p1lg.myhost.com+10.2.12.110, capp+omsec33-p1lg.myhost.com+10.2.12.31, capp+omsec13-p1lg.myhost.com+10.2.12.143, capp+omsec34-p1lg.myhost.com+10.2.12.29, capp+omsec10-p1lg.myhost.com+10.2.12.63, capp+OMSEC46-P1LG+10.12.2.11, capp+omsec20-p1lg.myhost.com+10.2.12.28, capp+OMSEC39-P1LG+10.15.15.26, capp+OMSEC54-P1LG+10.12.60.11, capp+omseap-p1lg.myhost.com+10.2.12.92, capp+omsec36-p1lg.myhost.com+10.2.12.41, capp+omsec35-p1lg.myhost.com+10.2.12.33, capp+omsec8-p1lg.myhost.com+10.2.12.140, capp+OMSEC48-P1LG+10.12.2.13, capp+omsec23-p1lg.myhost.com+10.2.12.109, capp+OMSEC52-P1LG+10.12.60.15, capp+omsec30-p1lg.myhost.com+10.2.12.86, capp+omsec14-p1lg.myhost.com+10.2.12.79, capp+OMSEC43-P1LG+10.15.15.36, capp+OMSEC53-P1LG+10.12.60.10, capp+OMSEC63-P1LG+10.12.60.24, capp+omsec12-p1lg.myhost.com+10.2.12.142, capp+omsec28-p1lg.myhost.com+10.2.12.144, capp+omsec16-p1lg.myhost.com+10.2.12.81, capp+omsec2-p1lg.myhost.com+10.2.12.98, capp+omsec32-p1lg.myhost.com+10.2.12.88, capp+omsec4-p1lg.myhost.com+10.2.12.32, capp+omsec26-p1lg.myhost.com+10.2.12.84, capp+omsec31-p1lg.myhost.com+10.2.12.87, capp+OMSEC41-P1LG+10.15.15.30, capp+OMSEC64-P1LG+10.12.60.25, capp+omsec7-p1lg.myhost.com+10.2.12.78, capp+OMSEC44-P1LG+10.15.15.37, capp+OMSEC60-P1LG+127.0.1.1, capp+omsec25-p1lg.myhost.com+10.2.12.83, capp+omsec15-p1lg.myhost.com+10.2.12.80, capp+OMSEC58-P1LG+10.12.60.19, capp+OMSEC45-P1LG+10.12.2.10, capp+OMSEC38-P1LG+127.0.1.1, capp+omsec17-p1lg+10.2.12.82, capp+OMSEC49-P1LG+10.12.2.14, capp+omsec22-p1lg.myhost.com+10.2.12.72, capp+omsec11-p1lg.myhost.com+10.2.12.141, capp+OMSEC65-P1LG+10.12.60.26, capp+OMSEC51-P1LG+10.12.60.14, capp+omsec19-p1lg.myhost.com+10.2.12.27]

You can clearly see that capp+omsec19-p1lg.myhost.com+10.2.12.27 appears twice in the view. This happened after the server was restarted.

It is worth mentioning that after further investigation using JMX, we observed that in VERIFY_SUSPECT, the coordinator has one of these two hosts in its suspects list which is never cleared(stuck there permanently until a new coordinator gets in place).

What is the cause of this? Any input here, @belaban ?

Thanks in advance!

belaban · 2024-10-15T11:32:02Z

belaban
Oct 15, 2024
Maintainer

This is possible and can happen in the following case:

No FD_SOCK{2} protocol present
FD / FD_ALL / FD_ALLX protocol present
A member is killed (no graceful leave with JChannel.disconnect())
Before the FD_ALL timeout can kick in, the same member joins again
This will result in the member being in the view twice before it is suspected by FD_ALL and excluded by GMS
Note that this just the logical name, not the UUID, which will be different

What can you do to prevent this?

Make the member leave gracefully (JChannel.disconnect()/close())
Install a shutdown hook which closes the channel (doesn't work with kill -9)
Wait until you restart the member

0 replies

sonofsounds · 2024-10-15T13:02:42Z

sonofsounds
Oct 15, 2024
Author

Hello, @belaban ! Thanks for your quick reply. Indeed, this is what happened(the member was SIGKILLED). I have two more questions if you don't mind:

We are currently on Java 8, planning to migrate soon to Java 21 with the latest JGroups version, but until then can we use old FD_SOCK with FD_ALL?
I am pretty sure that I saw the member stuck in suspect mode, and the GMS didn't exclude it. So the view remained the same(with duplicate values) until we reset the coordinator. What would be the cause here?

Thanks again!

0 replies

belaban · 2024-10-15T13:25:34Z

belaban
Oct 15, 2024
Maintainer

Yes, but get rid of FD and use FD_ALL or FD_ALL2. The reason for members being stuck was probably FD, so using FD_ALLX should make this disappear.

0 replies

sonofsounds · 2024-10-15T13:41:37Z

sonofsounds
Oct 15, 2024
Author

Hello, @belaban ! This happened with the following config:

<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xmlns="urn:org:jgroups"
        xsi:schemaLocation="http://www.jgroups.org/schema/jgroups.xsd">
    <TCP bind_port="${jgroups.tcp.bind_port}" sock_conn_timeout="300"/>
    <TCPGOSSIP initial_hosts="${jgroups.tcpgossip.initial_host:localhost[12001]}"/>
    <MERGE3/>
    <FD_ALL timeout="12000" interval="3000" timeout_check_interval="2000"/>
    <VERIFY_SUSPECT/>
    <pbcast.NAKACK2 use_mcast_xmit="false"/>
    <UNICAST3/>
    <pbcast.STABLE/>
    <pbcast.GMS/>
    <FRAG2/>
</config>

We already use FD_ALL without FD_SOCK(we will try to deploy the apps with FD_SOCK as well). It seemed that the VERIFY_SUSPECT was not able to pass the SUSPECT event up the protocol stack. That suspect got stuck in the DelayQueue and the GMS did not exclude it resulting in a constant view which included duplicate addresses(example from the initial post).

protected final DelayQueue<Entry> suspects=new DelayQueue<>();

0 replies

belaban · 2024-10-15T14:03:49Z

belaban
Oct 15, 2024
Maintainer

What do you mean by 'stuck' in the DelayQueue? Did you debug/stack trace this?

1: Can you reproduce this? Best with 5.x, as I don't really support 4.x anymore
2: If/when this happens, can you post a stack trace?

0 replies

sonofsounds · 2024-10-15T14:08:16Z

sonofsounds
Oct 15, 2024
Author

Not really, I could not reproduce by using debug(I saw it using JMX), but here is a log from when this happened:

These are the timestamps from FD_ALL:

capp+OMSEC55-P1LG+10.12.60.12: 1 secs old
capp+omsec18-p1lg.myhost.com+10.2.12.26: 1 secs old
capp+omsec19-p1lg.myhost.com+10.2.12.27: 160437 secs old
capp+omsec42-p1lg.myhost.com+127.0.1.1: 2 secs old
capp+omsec7-p1lg.myhost.com+10.2.12.78: 5 secs old
capp+OMSEC58-P1LG+10.12.60.19: 0 secs old
capp+OMSEC37-P1LG+10.15.15.13: 0 secs old
masterapp+omsec1-p1lg+10.3.12.53: 1 secs old
capp+OMSEC51-P1LG+10.12.60.14: 0 secs old
capp+omsec27-p1lg.myhost.com+10.2.12.85: 4 secs old
capp+omsec6-p1lg.myhost.com+10.2.12.67: 1 secs old
capp+OMSEC59-P1LG+10.12.60.20: 0 secs old
capp+omsec24-p1lg.myhost.com+10.2.12.110: 1 secs old
capp+omsec33-p1lg.myhost.com+10.2.12.31: 0 secs old
capp+omsec13-p1lg.myhost.com+10.2.12.143: 0 secs old
capp+OMSEC64-P1LG+10.12.60.25: 0 secs old
capp+omsec34-p1lg.myhost.com+10.2.12.29: 1 secs old
capp+OMSEC40-P1LG+127.0.1.1: 1 secs old
capp+OMSEC56-P1LG+10.12.60.17: 2 secs old
capp+OMSEC45-P1LG+10.12.2.10: 4 secs old
capp+omsec10-p1lg.myhost.com+10.2.12.63: 1 secs old
capp+OMSEC46-P1LG+10.12.2.11: 0 secs old
capp+omsec20-p1lg.myhost.com+10.2.12.28: 1 secs old
capp+OMSEC44-P1LG+10.15.15.37: 1 secs old
capp+OMSEC39-P1LG+10.15.15.26: 1 secs old
capp+OMSEC54-P1LG+10.12.60.11: 1 secs old
capp+omseap-p1lg.myhost.com+10.2.12.92: 1 secs old
capp+omsec36-p1lg.myhost.com+10.2.12.41: 0 secs old
capp+omsec9-p1lg.myhost.com+10.2.12.102: 0 secs old
capp+omsec35-p1lg.myhost.com+10.2.12.33: 4 secs old
capp+omsec8-p1lg.myhost.com+10.2.12.140: 1 secs old
capp+omsec17-p1lg+10.2.12.82: 0 secs old
capp+OMSEC48-P1LG+10.12.2.13: 1 secs old
capp+omsec23-p1lg.myhost.com+10.2.12.109: 3 secs old
capp+OMSEC62-P1LG+10.12.60.23: 2 secs old
capp+OMSEC50-P1LG+10.12.60.13: 1 secs old
capp+omsec30-p1lg.myhost.com+10.2.12.86: 1 secs old
capp+OMSEC52-P1LG+10.12.60.15: 3 secs old
capp+omsec14-p1lg.myhost.com+10.2.12.79: 2 secs old
capp+OMSEC43-P1LG+10.15.15.36: 2 secs old
capp+OMSEC57-P1LG+10.12.60.18: 0 secs old
capp+omsec22-p1lg.myhost.com+10.2.12.72: 0 secs old
capp+OMSEC53-P1LG+10.12.60.10: 1 secs old
capp+omsec3-p1lg.myhost.com+10.2.12.137: 1 secs old
capp+omsec11-p1lg.myhost.com+10.2.12.141: 2 secs old
capp+OMSEC63-P1LG+10.12.60.24: 1 secs old
capp+omsec12-p1lg.myhost.com+10.2.12.142: 2 secs old
capp+omsec25-p1lg.myhost.com+10.2.12.83: 1 secs old
capp+OMSEC65-P1LG+10.12.60.26: 0 secs old
capp+OMSEC38-P1LG+127.0.1.1: 2 secs old
capp+omsec28-p1lg.myhost.com+10.2.12.144: 2 secs old
capp+omsec16-p1lg.myhost.com+10.2.12.81: 1 secs old
capp+omsec2-p1lg.myhost.com+10.2.12.98: 0 secs old
capp+omsec32-p1lg.myhost.com+10.2.12.88: 1 secs old
capp+OMSEC60-P1LG+127.0.1.1: 1 secs old
capp+omsec5-p1lg.myhost.com+10.2.12.77: 0 secs old
capp+OMSEC61-P1LG+10.12.60.22: 2 secs old
capp+omsec4-p1lg.myhost.com+10.2.12.32: 2 secs old
capp+omsec31-p1lg.myhost.com+10.2.12.87: 0 secs old
capp+OMSEC41-P1LG+10.15.15.30: 2 secs old
capp+omsec26-p1lg.myhost.com+10.2.12.84: 2 secs old
capp+omsec15-p1lg.myhost.com+10.2.12.80: 2 secs old
capp+OMSEC49-P1LG+10.12.2.14: 5 secs old

You can see that: capp+omsec19-p1lg.myhost.com+10.2.12.27: 160437 secs old. The timestamps should be cleared, by this I mean updated with the members from the newly generated view, right? We will try to use the FD_SOCK paired with FD_ALL as well.

0 replies

belaban · 2024-10-15T14:13:25Z

belaban
Oct 15, 2024
Maintainer

Yes, FD_ALL should clear 10.2.12.27 as soon as it receives a view without 10.2.12.27.

0 replies

sonofsounds · 2024-10-15T14:14:26Z

sonofsounds
Oct 15, 2024
Author

Hello, @belaban ! Yeah, but as you can see it got stuck for like 2 days. The fix was to switch to another coordinator, but it's just a quick fix since it does not solve the problem. Any ideas? How this could happen?

Thanks a lot! Really appreciate it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate addresses in view after upgrade #845

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Duplicate addresses in view after upgrade #845

sonofsounds Oct 15, 2024

Replies: 8 comments

belaban Oct 15, 2024 Maintainer

sonofsounds Oct 15, 2024 Author

belaban Oct 15, 2024 Maintainer

sonofsounds Oct 15, 2024 Author

belaban Oct 15, 2024 Maintainer

sonofsounds Oct 15, 2024 Author

belaban Oct 15, 2024 Maintainer

sonofsounds Oct 15, 2024 Author

sonofsounds
Oct 15, 2024

belaban
Oct 15, 2024
Maintainer

sonofsounds
Oct 15, 2024
Author

belaban
Oct 15, 2024
Maintainer

sonofsounds
Oct 15, 2024
Author

belaban
Oct 15, 2024
Maintainer

sonofsounds
Oct 15, 2024
Author

belaban
Oct 15, 2024
Maintainer

sonofsounds
Oct 15, 2024
Author