Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to start silo cluster after membership table inaccessible #9220

Open
grendizeras opened this issue Nov 8, 2024 · 0 comments
Open

Comments

@grendizeras
Copy link

grendizeras commented Nov 8, 2024

We run orleans in docker swarm. In case of datacenter failure silos are moved to other datacenter by swarm. Redis instance is also down and second instance in cluster is working at this time.

Problems that occur: 1. Redis memebership implementation could not detect change in redis and still tries to connect to downed instance and timed out.
2. Newly started silo instances do connect to redis, but as they see active silos in membership table, they try to ping them and fail as there are silos that are marked active but are dead in reality and old instances can't update it as they have no access to membership table. New silos declare themselves dead. Tried to change IAmAliveTablePublishTimeout and made it 1 minute. So my expectation would be that in 2 minutes (considering retries) new silos would still start up, and kill other inaccessible silos. For example there are 9 silos in total, and on datacenter shutdown 4 silos where dead and try to start in swarm on other datacenter. 4 silos should start in 2 minutes and kill left 5 silos.
In reality nothing is started up and whole cluster is down with following messages:
warn: Orleans.Runtime.Metadata.ClusterManifestProvider[0]

  Error retrieving silo manifest for silo S10.224.1.52:11111:90056967

  System.ObjectDisposedException: Cannot access a disposed object.

  Object name: 'IServiceProvider'.

     at Microsoft.Extensions.DependencyInjection.ServiceLookup.ThrowHelper.ThrowObjectDisposedException()

     at Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope.GetService(Type serviceType)

     at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService(IServiceProvider provider, Type serviceType)

     at Microsoft.Extensions.DependencyInjection.ServiceProviderServiceExtensions.GetRequiredService[T](IServiceProvider provider)

     at Orleans.Runtime.Metadata.ClusterManifestProvider.<>c__DisplayClass18_0.<<UpdateManifest>g__GetManifest|0>d.MoveNext() in /_/src/Orleans.Runtime/Manifest/ClusterManifestProvider.cs:line 151

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.229:11111:90056367 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:38, now is 11/08/2024 07:58:08, no update for 00:18:30.2544584, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.231:11111:90056456 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:41:07, now is 11/08/2024 07:58:08, no update for 00:17:01.1361120, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.245:11111:90056587 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:57, now is 11/08/2024 07:58:08, no update for 00:14:10.6124054, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.247:11111:90056616 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:51, now is 11/08/2024 07:58:08, no update for 00:14:16.8296069, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.228:11111:90056363 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:34, now is 11/08/2024 07:58:08, no update for 00:18:33.7728804, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.232:11111:90056493 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:42:28, now is 11/08/2024 07:58:08, no update for 00:15:40.0318086, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.1.2:11111:90056742 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:45:53, now is 11/08/2024 07:58:08, no update for 00:12:15.2249331, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.1.52:11111:90056967 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:49:31, now is 11/08/2024 07:58:08, no update for 00:08:36.9655455, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.229:11111:90056367 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:38, now is 11/08/2024 07:58:08, no update for 00:18:30.2683603, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.231:11111:90056456 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:41:07, now is 11/08/2024 07:58:08, no update for 00:17:01.1500076, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.245:11111:90056587 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:57, now is 11/08/2024 07:58:08, no update for 00:14:10.6263153, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.247:11111:90056616 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:43:51, now is 11/08/2024 07:58:08, no update for 00:14:16.8435249, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.228:11111:90056363 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:39:34, now is 11/08/2024 07:58:08, no update for 00:18:33.7867953, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.0.232:11111:90056493 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:42:28, now is 11/08/2024 07:58:08, no update for 00:15:40.0457486, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.1.2:11111:90056742 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:45:53, now is 11/08/2024 07:58:08, no update for 00:12:15.2388805, which is more than 00:01:20.

warn: Orleans.Runtime.MembershipService.MembershipTableManager[100625]

  Noticed that silo S10.224.1.52:11111:90056967 has not updated it's IAmAliveTime table column recently. Last update was at 11/08/2024 07:49:31, now is 11/08/2024 07:58:08, no update for 00:08:36.9794955, which is more than 00:01:20.

warn: Orleans.Runtime.Silo[100418]

  Silo shutdown completed (non-graceful)!

Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.224.0.229:11111:90056367, will retry after 953.288ms

at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99

at Orleans.Runtime.Messaging.MessageCenter.g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 236

at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81

at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117

at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51

at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88

at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 90

at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 739

at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29

at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 368

at Orleans.Runtime.Messaging.MessageCenter.g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 487

at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81

at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117

at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51

at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88

at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync(GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 98

at Program.<>c.<<

$>b__0_2>d.MoveNext() in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 38

--- End of stack trace from previous location ---

at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct) in /_/src/Orleans.Runtime/Lifecycle/SiloLifecycleSubject.cs:line 134

at Orleans.LifecycleSubject.OnStart(CancellationToken cancellationToken) in /_/src/Orleans.Core/Lifecycle/LifecycleSubject.cs:line 118

at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in /_/src/Orleans.Runtime/Scheduler/ClosureWorkItem.cs:line 33

at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Silo/Silo.cs:line 192

at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Hosting/SiloHostedService.cs:line 28

at Microsoft.Extensions.Hosting.Internal.Host.b__15_1(IHostedService service, CancellationToken token)

at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List1 exceptions, Func`3 operation)

at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)

at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)

at Microsoft.Extensions.Hosting.HostingAbstractionsHostExtensions.RunAsync(IHost host, CancellationToken token)

at Program.

$(String[] args) in /home/vsts/work/1/s/Slots/Ardados.Slots.SiloHost/Program.cs:line 44

at Program.

(String[] args)

In redis membership table, there are constantly growing count of newly inserted records of silos.
I guess silos are starting up, writing their state into table, then trying to check other silos, see that not all are pingable and kill themselves. And this is never ending loop.

What is wrong with our assumption, and why do we receive this error on start up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant