Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External cluster client host fails to start up when all gateways are dead initially #8716

Closed
amoerie opened this issue Nov 10, 2023 · 3 comments · Fixed by #8793
Closed

External cluster client host fails to start up when all gateways are dead initially #8716

amoerie opened this issue Nov 10, 2023 · 3 comments · Fixed by #8793

Comments

@amoerie
Copy link

amoerie commented Nov 10, 2023

Orleans: 7.2.1

Relates to #7436

When our application starts up and tries to connect to an external Orleans cluster (which is completely offline), the host that contains the cluster client crashes during its .StartAsync method.

This is what we see in the logging:

Orleans.Runtime.SiloUnavailableException: Could not find any gateway in Orleans.Runtime.Membership.AdoNetGatewayListProvider. Orleans client cannot initialize.
   at async Task Orleans.Messaging.GatewayManager.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Core/Messaging/GatewayManager.cs:line 75
   at async Task Orleans.OutsideRuntimeClient.StartInternal(CancellationToken cancellationToken)+(?) => { } in /_/src/Orleans.Core/Runtime/OutsideRuntimeClient.cs:line 156
   at async Task Orleans.OutsideRuntimeClient.StartInternal(CancellationToken cancellationToken)+ExecuteWithRetries(?) in /_/src/Orleans.Core/Runtime/OutsideRuntimeClient.cs:line 183
   at async Task Orleans.OutsideRuntimeClient.StartInternal(CancellationToken cancellationToken) in /_/src/Orleans.Core/Runtime/OutsideRuntimeClient.cs:line 155
   at async Task Orleans.OutsideRuntimeClient.Start(CancellationToken cancellationToken) in /_/src/Orleans.Core/Runtime/OutsideRuntimeClient.cs:line 144
   at async Task Orleans.ClusterClient.StartAsync(CancellationToken cancellationToken) in /_/src/Orleans.Core/Core/ClusterClient.cs:line 72
   at async Task Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at async Task Dobco.POW4.DicomProcessor.Client.Orleans.DpClusterClientBackgroundService.ExecuteAsync(CancellationToken stoppingToken) in C:/git/pow4/Dobco.POW4.DicomProcessor.Client/Orleans/DpClusterClientBackgroundService.cs:line 35
1

Technically it's fine that some exception is logged - the cluster is offline after all -, but because the host crashes during startup, no reconnection attempt is ever made, so if the cluster starts up a little bit later, it never works.

Today, we're working around this issue with the following background service:

using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Options;

public interface IClusterClientProvider
{
    IHost Host { get; }
    IClusterClient ClusterClient { get; }
}

public sealed class ClusterClientBackgroundService : BackgroundService
{
    private readonly ILogger<ClusterClientBackgroundService> _logger;
    private readonly IClusterClientProvider _clusterClientProvider;
    private bool _isHostStarted;
    
    public ClusterClientBackgroundService(
        ILogger<ClusterClientBackgroundService> logger,
        IClusterClientProvider clusterClientProvider)
    {
        _logger = logger ?? throw new ArgumentNullException(nameof(logger));
        _clusterClientProvider = clusterClientProvider ?? throw new ArgumentNullException(nameof(clusterClientProvider));
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        // Setting up the host and the cluster client is omitted for brevity
        var host = _clusterClientProvider.Host;
        var retryDelay = TimeSpan.FromSeconds(30);
        while (!stoppingToken.IsCancellationRequested && !_isHostStarted)
        {
            try
            {
                try
                {
                    _logger.LogInformation("Starting cluster client host");

                    await host.StartAsync(stoppingToken);

                    _isHostStarted = true;

                    _logger.LogInformation("Cluster client host started successfully");
                }
                catch (OrleansException e)
                {
                    _isHostStarted = false;
                    _logger.LogWarning(e, "Failed to start cluster client host, will retry in {RetryDelay}", retryDelay);
                    await Task.Delay(retryDelay, stoppingToken);
                }
            }
            catch (OperationCanceledException)
            {
                // Ignored
            }
        }
    }

    public override async Task StopAsync(CancellationToken cancellationToken)
    {
        _logger.LogInformation("Shutting down cluster client");
        
        await base.StopAsync(cancellationToken);
        
        var host = _clusterClientProvider.Host;
        
        if (_isHostStarted)
        {
            try
            {
                await host.StopAsync(CancellationToken.None);
            }
            catch(Exception e)
            {
                _logger.LogWarning(e, "An error occurred while stopping the cluster client host");
            }
        }
        
        switch (host)
        {
            case IAsyncDisposable asyncDisposable:
                await asyncDisposable.DisposeAsync();
                break;
            case IDisposable disposable:
                disposable.Dispose();
                break;
        }
        
        _logger.LogInformation("Successfully shut down cluster client");
    }
}

It would be better if the GatewayManager would gracefully handle all silos being offline, and just retry the initial connection from time to time.

@ghost ghost added the Needs: triage 🔍 label Nov 10, 2023
@amoerie amoerie changed the title External cluster client host fails to start up when all silos are dead initially External cluster client host fails to start up when all gateways are dead initially Nov 10, 2023
@slawomirpiotrowski
Copy link

Connection retry filter should do the trick:

    client.UseConnectionRetryFilter(async (exception, token) =>
    {
        if (exception.GetType() == typeof(SiloUnavailableException) && ++retryCount <= 5)
        {
            var logger = LoggerFactory.Create(builder => builder.AddConsole()).CreateLogger(typeof(Program));
            logger.LogWarning($"Orleans cluster is not ready. Retrying in {retryPause} seconds...");
            await Task.Delay(retryPause * 1000, token);
            retryPause *= 2;
            return true;
        }
        return false;
    });

@amoerie
Copy link
Author

amoerie commented Dec 5, 2023

Thanks!

I would understand if this issue were closed, but can I still vote for maybe adding such a retry filter as the default behavior, or at least documenting this clearly somewhere? If it already is, my apologies, but I failed to find it the first time around...

@ReubenBond
Copy link
Member

can I still vote for maybe adding such a retry filter as the default behavior, or at least documenting this clearly somewhere?

Good idea, let's make this the default.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants