Duplicate addresses in view after upgrade #845
Replies: 8 comments
-
This is possible and can happen in the following case:
What can you do to prevent this?
|
Beta Was this translation helpful? Give feedback.
-
Hello, @belaban ! Thanks for your quick reply. Indeed, this is what happened(the member was SIGKILLED). I have two more questions if you don't mind:
Thanks again! |
Beta Was this translation helpful? Give feedback.
-
Yes, but get rid of FD and use FD_ALL or FD_ALL2. The reason for members being stuck was probably FD, so using FD_ALLX should make this disappear. |
Beta Was this translation helpful? Give feedback.
-
Hello, @belaban ! This happened with the following config:
We already use FD_ALL without FD_SOCK(we will try to deploy the apps with FD_SOCK as well). It seemed that the VERIFY_SUSPECT was not able to pass the SUSPECT event up the protocol stack. That suspect got stuck in the DelayQueue and the GMS did not exclude it resulting in a constant view which included duplicate addresses(example from the initial post).
|
Beta Was this translation helpful? Give feedback.
-
What do you mean by 'stuck' in the DelayQueue? Did you debug/stack trace this? 1: Can you reproduce this? Best with 5.x, as I don't really support 4.x anymore |
Beta Was this translation helpful? Give feedback.
-
Not really, I could not reproduce by using debug(I saw it using JMX), but here is a log from when this happened: These are the timestamps from FD_ALL:
You can see that: capp+omsec19-p1lg.myhost.com+10.2.12.27: 160437 secs old. The timestamps should be cleared, by this I mean updated with the members from the newly generated view, right? We will try to use the FD_SOCK paired with FD_ALL as well. |
Beta Was this translation helpful? Give feedback.
-
Yes, FD_ALL should clear |
Beta Was this translation helpful? Give feedback.
-
Hello, @belaban ! Yeah, but as you can see it got stuck for like 2 days. The fix was to switch to another coordinator, but it's just a quick fix since it does not solve the problem. Any ideas? How this could happen? Thanks a lot! Really appreciate it. |
Beta Was this translation helpful? Give feedback.
-
Hello, recently we upgraded JGroups from version 3.6.3.Final to 4.2.30.Final and noticed some issues related to view changes. For initial membership discovery, we are using TCPGOSSIP. Below you can see the two configuration files(old and new) that are used for setting up the protocol stack.
New config used with JGroups 4.2.30.Final:
Old config used with JGroups 3.6.3.Final:
We have noticed that duplicate addresses appear in the view, and this seems to happen when certain members are disconnected consecutively. Below is a log where the issue can be observed:
2024-10-07 08:50:17,322 INFO domain.JGroupsMessageDispatcher [JGroupsChannelMediator-thread-2] - New view : [capp+OMSEC47-P1LG+10.12.2.12, capp+OMSEC57-P1LG+10.12.60.18, capp+omsec3-p1lg.myhost.com+10.2.12.137, capp+OMSEC50-P1LG+10.12.60.13, capp+omsec9-p1lg.myhost.com+10.2.12.102, capp+OMSEC37-P1LG+10.15.15.13, capp+omsec5-p1lg.myhost.com+10.2.12.77, capp+OMSEC40-P1LG+127.0.1.1, capp+OMSEC62-P1LG+10.12.60.23, capp+OMSEC56-P1LG+10.12.60.17, capp+omsec6-p1lg.myhost.com+10.2.12.67, masterapp+omsec1-p1lg+10.3.12.53, capp+OMSEC55-P1LG+10.12.60.12, capp+omsec18-p1lg.myhost.com+10.2.12.26, capp+omsec19-p1lg.myhost.com+10.2.12.27, capp+omsec42-p1lg.myhost.com+127.0.1.1, capp+omsec27-p1lg.myhost.com+10.2.12.85, capp+OMSEC59-P1LG+10.12.60.20, capp+omsec24-p1lg.myhost.com+10.2.12.110, capp+omsec33-p1lg.myhost.com+10.2.12.31, capp+omsec13-p1lg.myhost.com+10.2.12.143, capp+omsec34-p1lg.myhost.com+10.2.12.29, capp+omsec10-p1lg.myhost.com+10.2.12.63, capp+OMSEC46-P1LG+10.12.2.11, capp+omsec20-p1lg.myhost.com+10.2.12.28, capp+OMSEC39-P1LG+10.15.15.26, capp+OMSEC54-P1LG+10.12.60.11, capp+omseap-p1lg.myhost.com+10.2.12.92, capp+omsec36-p1lg.myhost.com+10.2.12.41, capp+omsec35-p1lg.myhost.com+10.2.12.33, capp+omsec8-p1lg.myhost.com+10.2.12.140, capp+OMSEC48-P1LG+10.12.2.13, capp+omsec23-p1lg.myhost.com+10.2.12.109, capp+OMSEC52-P1LG+10.12.60.15, capp+omsec30-p1lg.myhost.com+10.2.12.86, capp+omsec14-p1lg.myhost.com+10.2.12.79, capp+OMSEC43-P1LG+10.15.15.36, capp+OMSEC53-P1LG+10.12.60.10, capp+OMSEC63-P1LG+10.12.60.24, capp+omsec12-p1lg.myhost.com+10.2.12.142, capp+omsec28-p1lg.myhost.com+10.2.12.144, capp+omsec16-p1lg.myhost.com+10.2.12.81, capp+omsec2-p1lg.myhost.com+10.2.12.98, capp+omsec32-p1lg.myhost.com+10.2.12.88, capp+omsec4-p1lg.myhost.com+10.2.12.32, capp+omsec26-p1lg.myhost.com+10.2.12.84, capp+omsec31-p1lg.myhost.com+10.2.12.87, capp+OMSEC41-P1LG+10.15.15.30, capp+OMSEC64-P1LG+10.12.60.25, capp+omsec7-p1lg.myhost.com+10.2.12.78, capp+OMSEC44-P1LG+10.15.15.37, capp+OMSEC60-P1LG+127.0.1.1, capp+omsec25-p1lg.myhost.com+10.2.12.83, capp+omsec15-p1lg.myhost.com+10.2.12.80, capp+OMSEC58-P1LG+10.12.60.19, capp+OMSEC45-P1LG+10.12.2.10, capp+OMSEC38-P1LG+127.0.1.1, capp+omsec17-p1lg+10.2.12.82, capp+OMSEC49-P1LG+10.12.2.14, capp+omsec22-p1lg.myhost.com+10.2.12.72, capp+omsec11-p1lg.myhost.com+10.2.12.141, capp+OMSEC65-P1LG+10.12.60.26, capp+OMSEC51-P1LG+10.12.60.14, capp+omsec19-p1lg.myhost.com+10.2.12.27]
You can clearly see that
capp+omsec19-p1lg.myhost.com+10.2.12.27
appears twice in the view. This happened after the server was restarted.It is worth mentioning that after further investigation using JMX, we observed that in VERIFY_SUSPECT, the coordinator has one of these two hosts in its suspects list which is never cleared(stuck there permanently until a new coordinator gets in place).
What is the cause of this? Any input here, @belaban ?
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions