Reliability of client connectivity #597

2mike2 · 2015-07-12T09:41:18Z

Hi,
I have been testing our solution based on Orleans run-time 1.0.8 on production. We are talking about azure deployment of 2 medium silos as worker roles, a default cluster membership configuration, web roles acting as grain clients in a separate cloud service deployment living in the same v-net with silos and thus able to communicate. First of all tests show very good results, great work guys! During the tests one of the silos, the process got restarted, "committed suicide", being identified as dead by membership protocol. According to the logs from the the kamikaze silo, the whole process took 90 seconds. During that time the second silo was alive and continued to process messages. However, during that time, our web role (the grain clients) experienced timeouts while issuing grain requests. I am aware of time aspects of how membership protocol detects and handles failures.
My question is :

Will 3 silos instead of 2, provide a more robust "suspect dead voting" mechanism in such a case, which can result in a quicker "dead" voting and minimize the affect of the clients...
Do you have any other suggestion how to treat clients in such case?

Thanks,

I am attaching a snippet from the silos logs during the talked period:

silo 0 (10.0.0.4:11111):

[2015-07-09 20:41:17.315 GMT 27 WARNING 101022 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.4:11111] Exception sending message to S10.0.0.6:11111:174157512. Message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
[2015-07-09 20:41:17.330 GMT 25 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1305 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Retry count exceeded. Msg is: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022416: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping()
[2015-07-09 20:41:37.256 GMT 22 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:13.5172130.
[2015-07-09 20:41:37.412 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0583008.
[2015-07-09 20:41:37.818 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607429.
[2015-07-09 20:41:37.834 GMT 23 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607432.
[2015-07-09 20:41:38.131 GMT 21 WARNING 100815 AzureTableDataManager-GrainStateEntity 10.0.0.4:11111] Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:09.0607899.
[2015-07-09 20:41:41.026 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1306 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1022722: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Exception getting a sending socket to endpoint S10.0.0.6:11111:174157512
[2015-07-09 20:41:41.026 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1307 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1023002: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:41.026 GMT) connection failure trying to reach target silo S10.0.0.6:11111:174157512. Going to drop Request msg 1023002 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:41.057 GMT 23 WARNING 100613 MembershipOracle 10.0.0.4:11111] -Did not get ping response for ping #1308 from S10.0.0.6:11111:174157512. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.4:11111:174157138 is rejecting message: Request S10.0.0.4:11111:174157138MembershipOracle@S0000000f->S10.0.0.6:11111:174157512MembershipOracle@S0000000f #1023331: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:41.026 GMT) connection failure trying to reach target silo S10.0.0.6:11111:174157512. Going to drop Request msg 1023331 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
[2015-07-09 20:41:41.057 GMT 25 WARNING 100628 MembershipOracle 10.0.0.4:11111] I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:41.073 GMT 25 ERROR 100627 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:41.073 GMT 25 ERROR 100627 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! About to crash! Fail message is: I have been told I am dead, so this silo will commit suicide! I should be Dead according to membership table (in TryToSuspectOrKill): entry = SiloAddress=S10.0.0.4:11111:174157138 Status=Dead.
[2015-07-09 20:41:42.562 GMT 25 ERROR 100001 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! Application mini-dump written to file E:\approot\Orleans-MiniDump-2015-07-09-20-41-41-088Z.dmp
[2015-07-09 20:41:42.656 GMT 25 ERROR 100002 MembershipOracle 10.0.0.4:11111] !!!!!!!!!! INTERNAL FAILURE! Process crashing!
.
.

[2015-07-09 20:42:39.491 GMT 18 INFO 100406 OrleansSiloHost ] Setting Deployment Id to XXXXXXXXXXX and data connection string to DefaultEndpointsProtocol=https;AccountName=XXXXXXXXXsilos;AccountKey=<--SNIP-->
[2015-07-09 20:42:39.506 GMT 18 INFO 100403 OrleansSiloHost ] Initializing Silo YYYYYYYYYYYY on host=RD000D3A607B7E CPU count=2 running .NET version='4.0.30319.34014' Is .NET 4.5=True OS version='Microsoft Windows NT 6.2.9200.0'
[2015-07-09 20:42:39.506 GMT 18 INFO 100404 OrleansSiloHost ] Silo running with GC settings: ServerGC=False GCLatencyMode=Interactive
[2015-07-09 20:42:39.506 GMT 18 WARNING 100405 OrleansSiloHost ] Note: Silo not running with ServerGC turned on - recommend checking app config : -- and --
[2015-07-09 20:42:46.387 GMT 18 INFO 100000 MembershipFactory 10.0.0.4:11111] Creating membership oracle for type=AzureTable
[2015-07-09 20:42:46.418 GMT 16 INFO 100801 OrleansSiloInstanceManager 10.0.0.4:11111] Attached to Azure storage table OrleansSiloInstances
[2015-07-09 20:42:46.465 GMT 12 INFO 100836 OrleansSiloInstanceManager 10.0.0.4:11111] Initialized schema for Azure storage table OrleansSiloInstances

silo 1 (10.0.0.6:11111):

[2015-07-09 20:41:11.509 GMT [2015-07-09 20:41:11.509 GMT [2015-07-09 20:41:11.525 GMT [2015-07-09 20:41:34.788 GMT [2015-07-09 20:41:34.788 GMT [2015-07-09 20:41:34.788 GMT [2015-07-09 20:41:34.820 GMT [2015-07-09 20:41:34.835 GMT [2015-07-09 20:41:34.866 GMT [2015-07-09 20:41:34.898 GMT [2015-07-09 20:41:34.898 GMT [2015-07-09 20:41:34.913 GMT [2015-07-09 20:41:34.929 GMT [2015-07-09 20:41:34.929 GMT [2015-07-09 20:41:35.007 GMT [2015-07-09 20:41:35.007 GMT [2015-07-09 20:41:35.007 GMT [2015-07-09 20:42:53.937 GMT [2015-07-09 20:42:53.937 GMT [2015-07-09 20:42:53.937 GMT [2015-07-09 20:42:53.937 GMT 26 WARNING 101022 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.6:11111] Exception sending message to S10.0.0.4:11111:174157138. Message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
27 WARNING 101022 Runtime.Messaging.SiloMessageSender.SystemSender/2 10.0.0.6:11111] Exception sending message to S10.0.0.4:11111:174157138. Message: Request S10.0.0.6:11111:174157512Catalog@S0000000e->S10.0.0.4:11111:174157138DirectoryService@S0000000a #926152: Orleans.Runtime.Orleans.Runtime.IRemoteGrainDirectory:LookUp(). System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg) Exc level 0: System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host at Orleans.Messaging.OutgoingMessageSender.Process(Message msg)
22 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1293 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Retry count exceeded. Msg is: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #925887: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping()
26 WARNING 101021 Runtime.Messaging.SiloMessageSender.PingSender/1 10.0.0.6:11111] Exception getting a sending socket to endpoint S10.0.0.4:11111:174157138 Exc level 0: System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.0.0.4:11111 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) at Orleans.Runtime.SocketManager.SendingSocketCreator(IPEndPoint target) at Orleans.Runtime.LRU`2.Get(TKey key) at Orleans.Runtime.Messaging.SiloMessageSender.GetSendingSocket(Message msg, Socket& socket, SiloAddress& targetSilo, String& error)
24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1294 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926185: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Exception getting a sending socket to endpoint S10.0.0.4:11111:174157138
24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1295 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926518: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:34.788 GMT) connection failure trying to reach target silo S10.0.0.4:11111:174157138. Going to drop Request msg 926518 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
24 WARNING 100613 MembershipOracle 10.0.0.6:11111] -Did not get ping response for ping #1296 from S10.0.0.4:11111:174157138. Reason = Original Exc Type: Orleans.Runtime.OrleansException Message:Silo S10.0.0.6:11111:174157512 is rejecting message: Request S10.0.0.6:11111:174157512MembershipOracle@S0000000f->S10.0.0.4:11111:174157138MembershipOracle@S0000000f #926822: Orleans.Runtime.Orleans.Runtime.IMembershipService:Ping(). Reason = Recent (00:00:00 ago, at 2015-07-09 20:41:34.788 GMT) connection failure trying to reach target silo S10.0.0.4:11111:174157138. Going to drop Request msg 926822 without sending. CONNECTION_RETRY_DELAY = 00:00:01.
22 INFO 100611 MembershipOracle 10.0.0.6:11111] -Going to mark silo S10.0.0.4:11111:174157138 as DEAD in the table #1. I am the last voter: #freshVotes=0, myVoteIndex = -1, NumVotesForDeathDeclaration=2 , #activeSilos=2, suspect list=[]
22 INFO 100611 MembershipOracle 10.0.0.6:11111] -Going to mark silo S10.0.0.4:11111:174157138 as DEAD in the table #1. I am the last voter: #freshVotes=0, myVoteIndex = -1, NumVotesForDeathDeclaration=2 , #activeSilos=2, suspect list=[]
24 INFO 103004 VirtualBucketsRingProvider 10.0.0.6:11111] Removed Server S10.0.0.4:11111:174157138/x02A79268. Current view: [S10.0.0.6:11111:174157512 -> <MultiRange: Size=x100000000, %Ring=100.000%>]
24 INFO 103005 VirtualBucketsRingProvider 10.0.0.6:11111] -NotifyLocalRangeSubscribers about old <MultiRange: Size=x8C57D355, %Ring=54.822%> new <MultiRange: Size=x100000000, %Ring=100.000%> increased? True
21 INFO 100524 Catalog 10.0.0.6:11111] Catalog is deactivating 814 activations due to a failure of silo S10.0.0.4:11111:174157138/x02A79268, since it is a primary directory partiton to these grain ids.
20 INFO 602920 ReminderService 10.0.0.6:11111] Log code 102920 occurred 25 additional times in the previous 00:06:12.8133786
20 INFO 102934 ReminderService 10.0.0.6:11111] My range changed from <MultiRange: Size=x8C57D355, %Ring=54.822%> to <MultiRange: Size=x100000000, %Ring=100.000%> increased = True
24 INFO 100612 MembershipOracle 10.0.0.6:11111] Will watch (actively ping) 0 silos: []
24 WARNING 100824 AzureSiloMembershipTable 10.0.0.6:11111] Update failed due to contention on the table. Will retry. Entry SiloAddress=S10.0.0.4:11111:174157138 Status=Dead, eTag W/"datetime'2015-07-09T20%3A39%3A10.7941473Z'", table version = <21, W/"datetime'2015-07-09T17%3A05%3A21.6691694Z'">
24 INFO 100629 MembershipOracle 10.0.0.6:11111] -Failed to update S10.0.0.4:11111:174157138 status to Dead in the Membership table, due to write conflicts. Will retry.
24 INFO 103003 VirtualBucketsRingProvider 10.0.0.6:11111] Added Server S10.0.0.4:11111:174170562/x9EF3654D. Current view: [S10.0.0.6:11111:174157512 -> <MultiRange: Size=x694D75D8, %Ring=41.134%>, S10.0.0.4:11111:174170562 -> <MultiRange: Size=x96B28A28, %Ring=58.866%>]
24 INFO 103005 VirtualBucketsRingProvider 10.0.0.6:11111] -NotifyLocalRangeSubscribers about old <MultiRange: Size=x100000000, %Ring=100.000%> new <MultiRange: Size=x694D75D8, %Ring=41.134%> increased? True
24 INFO 100612 MembershipOracle 10.0.0.6:11111] Will watch (actively ping) 1 silos: [S10.0.0.4:11111:174170562]
24 INFO 100645 MembershipOracle 10.0.0.6:11111] -ReadAll (called from gossip, after local view changed, with removed duplicate deads) Membership table: 3 silos, 2 are Active, 1 are Dead, Version=<23, W/"datetime'2015-07-09T20%3A42%3A53.4760565Z'">. All silos: [SiloAddress=S10.0.0.4:11111:174157138 Status=Dead, SiloAddress=S10.0.0.4:11111:174170562 Status=Active, SiloAddress=S10.0.0.6:11111:174157512 Status=Active]

sergeybykov · 2015-07-12T23:16:51Z

Something is fishy here. It appears as if the two silos lost connectivity with each other. At the same time, silo 0 was reporting slow access to Azure Table:

Slow access to Azure Table OrleansGrainState for UpdateTableEntryAsync, which took 00:00:13.5172130

13 seconds for a storage operation is a LOT. This followed by several 9 second long operations. This may explain why it took 90 seconds for the whole process. The default config, which in my opinion is too lenient, is 3 consequently failed pings 10 seconds apart. This adds up to 30 seconds before voting an unresponsive silo dead.

If you share full silo and client logs, we may potentially spot additional clues to what was going on there.

gabikliot · 2015-07-13T05:15:56Z

Thank you for your kind words!

Regarding the problem:
I agree with Sergey: the silos suddenly lost connectivity with each other plus there was very slow access to Azure table. It would be interesting to understand what happened.

Regardless, no matter why they lost connectivity and despite the Azure table slowness, based on the logs you sent, everything actually worked as I expect it to.
The networking problems started at 41:11 at silo2. It took them about 25 seconds each to vote one of them dead. At 41:42 silo1 kills itself, 25 sec after he experienced the first error (41:17).
25-30 sec, as expected, given the default membership failure detection setting.

It took silo 1 one minute to restart. I don't now who restarted it and why it took a minute, but that is outside Orleans's scope. Right after it restarted it joined the cluster.
No problems so far.

As for the client - you did not provide the client log, but the client should not see errors beyond the time that all silos agreed on the dead silo. Initially, the client actually immediately stops using the bad silo, if its socket is broken. The client may still see some error if he routes to another healthy silo but that silo still does not know about the dead silo. But once all silos agree, all errors should stop.
Therefore, my expectation based on your logs, is that client sees up to 30 sec of errors, which is expected given the membership failure detection setting.
Do you see anything different?

As for your questions:

Will 3 silos instead of 2, provide a more robust "suspect dead voting" mechanism in such a case, which can result in a quicker "dead" voting and minimize the affect of the clients...

No. As I said, I don't see any problem in the current situation. 3 silos will not speed it up.

Do you have any other suggestion how to treat clients in such case?

If you want to test this scenario, start counting time from first to last timeout and check how long it takes. You may see a last timeout 60 sec after the first one in your setup, since the first msg is lost at time 0, last msg is lost at time 30 sec, but it takes 30 sec for it to timeout. So you may see 60 sec span from first error to last, but you will only loose msgs during 30 sec.

2mike2 · 2015-07-14T10:03:01Z

Hi guys, thanks for the details. I've checked my web clients logs and see that timeouts have started at 2015-07-09 20:41:00.640 and ended up at 2015-07-09 20:42:30.157. My grain client timeouts were configured to 5 sec and not the default 30 sec. Sadly, I don't find grain client logs from the incident. Probably due to not customizing the tracing and my azure web role clients failed to issue logs due to disk access permissions. I'll be more prepared next time I guess. Attaching link to the full silo logs if any help: http://1drv.ms/1TzutzM
Thanks again for the great support as usual.

gabikliot · 2015-07-24T22:17:48Z

@2mike2 , were you able to resolve this? Can we close this issue?

2mike2 · 2015-07-25T09:12:55Z

Yes, thank you
On 25 Jul 2015 01:18, "Gabriel Kliot" notifications@github.com wrote:

@2mike2 https://github.com/2mike2 , were you able to resolve this? Can
we close this issue?

—
Reply to this email directly or view it on GitHub
#597 (comment).

gabikliot added the question label Jul 13, 2015

gabikliot closed this as completed Jul 27, 2015

ghost locked as resolved and limited conversation to collaborators Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliability of client connectivity #597

Reliability of client connectivity #597

2mike2 commented Jul 12, 2015

sergeybykov commented Jul 12, 2015

gabikliot commented Jul 13, 2015

2mike2 commented Jul 14, 2015

gabikliot commented Jul 24, 2015

2mike2 commented Jul 25, 2015

Reliability of client connectivity #597

Reliability of client connectivity #597

Comments

2mike2 commented Jul 12, 2015

silo 0 (10.0.0.4:11111):

silo 1 (10.0.0.6:11111):

sergeybykov commented Jul 12, 2015

gabikliot commented Jul 13, 2015

2mike2 commented Jul 14, 2015

gabikliot commented Jul 24, 2015

2mike2 commented Jul 25, 2015