-
Notifications
You must be signed in to change notification settings - Fork 673
All peers become "unestablished" #515
Comments
Is this reproducible? And what's the minimum number of peers for which you see this happening? |
I did create an ansible script and easily can reproduce this. I'd say from trying different options to find a fix, I'd say that it seems fairly brittle and easy to repro. And the number of peers that are required doesn't seem fairly fixed, I've had it happen on as few as 6 as many 12... The typical operations I take is:
We do spin up consul inside of weave, from the weave blog it seems like using consul in weave shouldn't be an issue... Also adding more "sleep" into the operations seems to help... but not always.... |
There also does seem to be a correlation between removing a node and adding it back in that does seem to effect the issue. I've had some success if I've stopped weave on an instance for a little over a minute then re-launch it; but even 30 seconds between stop and starting an instance (a network outage could even be less then that) everything breaks down. Also, this is the output from
I would have expected that at least one instance to have been blacklisted, but not everything to lock up like this. Also, any insights on how to recover from this kind of issue, other then going around to servers and stoping weave one by one until it recovers, would be appreciated.. I've tried a couple other options like:
But neither seems to do anything... |
Any smaller number than 6?
Does the problem occur without encryption?
The two are equivalent. In your situation I'd go for the former, since then you don't need the 'sleep'.
The main purpose of Suggestions welcome on what the docs for connect should say to make this clearer.
The use of consul here should be coincidental. Indeed the problem should still arise without running any application containers. Would be nice to confirm that one way or the other. More generally, it would be great if you provided exact steps that would allow us to reproduce the problem. Even if they contain steps like "do xyz a few times, until it breaks". |
|
I've had much more success when I've disabled the encryption. |
"much more" == cannot get it to break at all? Note that as well as the log entry you mentioned, one of the stack traces shows a connection being stuck in the decryption code. That doesn't necessarily mean that the crypto code is faulty; it could simply be a consequence of subtly altered timing. Nevertheless, knowing whether the problem occurs at all w/o crypto would be incredibly useful. Similarly, as I mentioned it would be good to know whether the problem also arises w/o any application containers.
You really shouldn't have to do that. So, taking all the above into account, does the problem arise when you run
on one node and then repeatedly run
across, say, 8 nodes, and check the connectivity status by running And if you cannot get it to break that way, try with encryption enabled. |
It would help a lot if you could share this, feel free to email help@weave.works. |
Sorry, was busy yesterday and wasn't able to follow back here: @rade: regarding:
I only tried it twice, so hardly exhaustive, but in those attempts it didn't break and was able to reduce the wait times significantly I had put in without issue. Unfortunately insecure transmission of data isn't really an option for us I really don't have an active cluster I can easily test with so it isn't really that easy to break this process to most simplest steps and try to get something overly easy to produce. @errordeveloper: regarding the ansible scripts, I'll try to clean things up a bit in my scripts and send them to your email to see if they help inform... |
Also, it didn't seem that back on "git-066d8001dd6d" image that we had that much issue. Though at the time I can't say for sure that it wasn't just coincidence. But maybe there was something changed in there that would be a good place to start? |
@thomascramer I've just published new images (git-59ae50eb4ef9). Please give these a whirl by grabbing the latest weave script and running |
I've run through my stuff with a couple times with the git-59ae50eb4ef9 images, and seems to be working like a champ. Looking at the weave script, but wanted to confirm, that if I wanted to keep to that version, the best way to "pin" it is to |
Yes, either that or invoking weave as |
May have spoken too soon.. I've been able to repro in another environment, double checking my setup here... |
And did you? |
Yes, it seems like in my dev environment, I'm still seeing some issues; though not able to repro in the test setup environment I was testing with earlier. It does seem better since that fix, and seems a simple combo of consul and weave start up just fine... As best as I've come up with so far is that when I activate a weave instance on a server that already has elasticsearch on it and starts to replicate a lot of shards, everything again starts to shutdown... This is proving trying to replicate in a test environment... Not sure if it helps but do see a lot of errors from:
And I have confirmed that all my instances are using the new image... |
That may be a different problem. The symptoms of the error that we fixed is |
plus
do point to this possibly being a simple load-induced problem, i.e. heartbeats go missing due to high system load and hence weave thinks the associated connections are broken and will attempt to re-establish them. weave should recover from that once the load subsides. Either way though, it would be a separate issue. |
Yes, once I start to see a couple instances show up as "unestablished" they all go "unestablished' quite rapidly, and it seems to stay that way unless I start to pull nodes out by manually going out and doing weave stop on some of the nodes.. |
ok. how easy is this to reproduce in your dev env? and could we get access to that? |
Actually, let me refine the symptoms of what we fixed... you also shouldn't see many reconnects at the end of If you do see lots if reconnects listed then that again would be consistent with just seeing a load-induced connectivity breakage. |
It is rather easy for me to repro in my This is my current
|
well, that does show all the symptoms :( Can you post the complete logs for all three nodes somewhere? |
I meant at least three, including the node from which you got the above status. |
I am curious whether the logs still show connection attempts going on. Or whether basically everything has ground to a halt. |
They say they are attempting... But yes I can dump them on an s3 bucket, can you email some sort of account name or something I can grant access to? I think you were on the one help@weave email chain... |
So you see a fairly continuous stream of connection attempts in the logs? re dumping the log files... please put them somewhere and email help@weave.works with the details. |
The logs do indeed contain a continuous stream of connection attempts. Looks like UDP connectivity between peers is very patchy. This could be due to the elasticsearch replication load. Does that load ever subside? I would expect connectivity to recover then. |
incidentally, when running with encryption weave has a lower tolerance for UDP packet loss. |
Well, there isn't that much load per-say, and the load does go away once the weave network breaks down... obviously there is no way for it replicate if there is no route to host. Also sometimes I can add the instances just fine so maybe it is just a coincidence. Initially when I saw these issues I had seen some recovery after several hours, but haven't had the luxury to set and let it wait to come back to life. Shutting down weave on a couple nodes does seem to help it recover. Also slowly (adding a lot of sleep) around doing a rolling restart of weave seems to help. but this typically requires bringing the weave service down for about a minute or so and then bringing back up and then waiting another minute or so before starting up consul. I haven't tried with the current image, but the previous one it didn't seem as problematic without the encryption. However, one of our main interest points is using weave with the encryption. |
Reproduced.
And if I run |
With NUM_WEAVES=7 I saw a few unestablished connections that hang around for a while but then disappeared. With 10 the above reproduces quite reliably. YMMW depending on hardware spec. |
My initial suspicion was that crypto loses sync due to excessive packet loss. But it's hard to see how that would happen with just four peers and no traffic. (4 peers is where I see the first crypto errors appearing). Also, the first error I am seeing is usually "Unable to decrypt UDP packet", and with some debugging added I can see that the offset is 0. Odds are that this really is genuinely the first packet. So why is the decryption failing? |
Here's what I think is happening:
The first few times this happens, the result is an 'unable to decrypt UDP packet' error. After a while, the same flaw causes the decryptor to stall, expecting a new nonce to come in. It's only when the other side times out and re-starts the connection that a nonce does come in, which lets the code do a few more things. But fundamentally it's all stymied at this point. |
What is interesting, about the issue, is that it would be one thing if it was that if a new server, X, came up and registered with servers, A, B, and C, and for whatever reason, X was causing a timeout issues with A, B, and C. However, it seems like this issue starts to spread, and suddenly A, B, and C who were connected and communicating just fine start to "timeout" and basically become unestablished to each other... I'm not as familiar with go, though have been looking at your actor setup; but wondering if there could also be an issue that when an actor or whatever is dealing with the "issues" from one "bad egg" if it in fact hinders their ability to keep up with other requests they are processing? |
Also, this may be a dumb question, but is possible to force the communication to be only tcp based? I don't see an obvious option for that, but it does look like it tries different options in communicating. Or would working solely via tcp be too much overhead? |
Big improvement in observed behaviour. LGTM; closes #515
@thomascramer I've just published another new set of images (git-36b13b704df4), which should hopefully fix the problem. As before, please grab the latest weave script and run |
@rade, Thanks, I've been putting it through the paces here today, and so far so good. |
I'm not sure if it is the current weave image that was pushed out last week or if it is testing more instances with it or what; but essentially at one point spinning up weave instances or if we restart weave instances (ie weave stop, let it be dead for a while then spin it up again; basically testing a rolling update situation or network interruption/system restart/what have you), the weave peers start having connectivity issues and then eventually basically all weave peers across all weave instances show up as "unestablished." Also containers connected via the weave instance isn't able to route traffic while this is happening... Sometimes waiting a couple hours resolves the issues, but often it seems the only way to recover is to just shutting down instances until things are happy again. We are very eager to utilize weave, but we can't proceed with weave in this state...
I'm currently on version:
I'm also working on EC2 instances inside a VPC, and all EC2 instances are in the same reason and shouldn't have issues communicating with on another.... We are currently trying to wire up 12 instances together (I have set -connlimit 50);
The last time it happened, on one of my servers, I saw
Looks like it was able to connect to the new server "demo-xxx-instance1" then afterwards machines started registering "Received packet for unknown destination"
On the first server that seemingly had issues, dev-xxx-instance1, it's logs have:
I also grabbed the dump from doing SIGQUIT:
The text was updated successfully, but these errors were encountered: