[Epic]: Support Multi-clustering for Orleans #7485

rafikiassumani-msft · 2022-01-10T03:43:00Z

Provide better geo-redundancy (different regions)
Data isolation needs in different regions (Country or region data governance requirements)
Provide grain activation capabilities at the edge (near)

We might need to add documentation on why this might not be potentially needed for certain use cases.

ReubenBond · 2022-01-13T22:22:40Z

We discussed this with an internal partner team today and have some ideas for how to implement this well. I'm documenting some of the takeaways of the discussion here for our own good.

I suggest we call this feature "metaclusters" because it involves a cluster of clusters (a cluster is a set of collaborating processes, this is a set of sets of processes).

We want to support communication between multiple Orleans clusters, typically geographically separated.

This will involve some changes to Orleans' core and we have a few guiding principles:

The feature should be opt-in and pay-for-what-you-use
There should be no unavoidable single points of failure in the design
Clusters should be able to communicate with each other via a load balancer (TCP/HTTPS) and should not require a shared VPN / full IP-addressability to each server in each cluster.

Our current thinking in terms of design and work:

Metacluster Membership

We will add a new provider for mapping ClusterIds to communication endpoints (TCP/HTTPS gateway enpoints) as well as for indicating liveness.
The default implementation of this will use a globally accessible table (eg, Azure Table Storage)
Administrators can manipulate this table to add/remove clusters from the metacluster (possibly using tooling / management APIs)
Enhancement: allow clusters to check each other for liveness and automatically update the table with status (Up/Down) depending on network connectivity. Clusters do not crash & restart when they detect that they are marked Down, instead, they check that they can communicate with the currently Up clusters before resetting their status in the table to Up again.
Conceptually, to remove the single-point-of-failure on a globally accessible database in the future, we can use a static list of cluster endpoints as a seed and then use consensus among them to determine which are alive or not. This is complex in itself, so it's best to defer this while keeping the option open.
Clusters are responsible for managing liveness of their constituent servers themselves. I.e, there are no cross-cluster per-server liveness checks.
Clusters read each other's membership using API calls against the cluster gateway (rather than directly reading membership tables from storage). This can also be used to get cluster manifests, etc, for metacluster-aware placement.

Routing/Networking

Servers will connect to each Up endpoint in the metacluster table and route all messages destined for any server in the metacluster table via that connection.
Therefore, each server will establish a connection to every server which is a part of its cluster as well as a connection to a single gateway in every other cluster. When the connection drops (eg because that gateway shuts down) they will retry the connection indefinitely (likely with some backoff).

Placement:

SiloAddress should have some way of identifying the cluster which it belongs to. That might include embedding a cluster id or it might involve migrating to a string instead of an IPEndPoint, where the string can map to an IPEndPoint for local communication (either via parsing or a mapping included in the membership table (eg, "ClusterX/SiloY" has endpoints: { IPv4: 10.0.0.2:11111 IPv6: ::::2:11111, FQDN: "silox.clustery.internal.contoso.com"}, or something), or an external mapping.
Therefore, IGrainLocator implementations have some mechanism for returning foreign addresses. Typically, we imagine that only a subset of grain types will be allowed to exist across clusters (single instance per metacluster). That would use the existing mechanisms (customizable grain locator/directory per grain)
Orleans checks that the target silo is live before routing calls to grains. Therefore, the check needs to incorporate some knowledge of the metacluster.
It's the job of the developer to configure a placement provider which uses a globally accessible directory: that is not a part of the metacluster feature
Even when metaclusters are enabled, we should require that placement providers specifically request "the list of all compatible servers in the metacluster" when performing placement. By default, they should only receive the list of local, compatible servers. An alternative is that we have some interface property (eg, bool IsMetaclusterAware { get; }) before we decide to feed it all compatible servers in the metacluster.

Non-goals:

We want to avoid any notion that this is some kind of panacea for globe-spanning applications which automatically do the most optimal thing in any given situation. For example, we are not planning to give Orleans the notion of the inter-cluster link latency or bandwidth and Orleans wont divine which cluster is the best to place a given grain, it has no intrinsic knowledge about data sovereignty, etc. The goal is to provide mechanisms for building globe-spanning applications rather than a packaged solution
Grains are either globally accessible or locally accessible, depending on their configure grain locator: if IMyGrain uses a locally-scoped grain directory in each cluster, then sending a reference to it to a globally-scoped grain in some other cluster will result in a reference which is locally scoped to that foreign cluster, not the originating cluster. If there's a need to have per-cluster grains, that can be accomplished by encoding something into the grain id and interpreting that at placement time.

cc @JohnMorman @juyuz

JesOb · 2022-05-11T12:19:08Z

Does that mean orleans will able to run with ASP .Net Core in Google App Engine Flexible?

Magazin80 · 2022-10-31T20:56:33Z

Is multi-clustering supported in the latest release v3.X? Documentation says that multi-clustering was removed in v2. Does it mean that we should not plan for it for now? Thanks!

ElanHasson · 2022-11-01T16:26:43Z

@Magazin80, no mult-clustering in v3.x and as of yet v7.0 (formerly 4.0) . I'll defer to @ReubenBond on the timeline.

jan-johansson-mr · 2023-02-20T04:31:06Z

Hi @ReubenBond and @ElanHasson,

The support for metaclusters seems good to me, that is, in my mind, a client can use a cluster meta ID that points to a collection of clusters, where the destination Silo may be in any of the "physical" clusters. But I am reacting to a wording in the proposal, and the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

So far, I've been able to use Orleans on-prem and with azure solution, but is the direction of some of the features of Orleans to be locked-in to Azure in the future? This worries me. Another such feature is the transactional feature, where the transactional provider given out-of-the-box in Orleans only works on Azure, there is no provider given, out-of-the-box, that works on ADO. Of course, one can write such a provider, and I did. However, this metaclusters feature, if tailored only for Azure, is something completely different, if this is the intended direction.

Kindly

0x53A · 2023-02-22T22:55:34Z

[...] the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

I assume the term was meant on a technical level, not a billing level. Typically, pay as you go means that if you don't use the feature, you don't "pay" for it in regards to performance, complexity, etc.

From my side, I'd be really interested in this for connecting Cloud (typically, but not necessarily, Azure) and On-Premise.

Currently we use a WCF Server in the cloud, a WCF Client on-premise, and then run duplex communication through a persistent Websocket using the net.http Binding and Client Callbacks.

We would be interested in modernizing this solution. The complexity comes from the fact that the On-Premise client / cluster / whatever you want to call it, is NOT directly internet addressable. Instead it will initiate the connection itself, and then everything needs to be routed through this one persistent connection. If the connection breaks, only the On-Premise part can re-establish the connection.

ReubenBond · 2023-02-22T23:41:49Z

[...] the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

I assume the term was meant on a technical level, not a billing level. Typically, pay as you go means that if you don't use the feature, you don't "pay" for it in regards to performance, complexity, etc.

Yes, precisely.

The complexity comes from the fact that the On-Premise client / cluster / whatever you want to call it, is NOT directly internet addressable.

In your case, would a solution based upon a VPN be sufficient?

0x53A · 2023-02-23T00:19:13Z

In your case, would a solution based upon a VPN be sufficient?

No, we have multiple customers running our software in their on-premise, with the cloud portion hosted by us. We don't really want to setup hundreds of VPNs. That's why using http/websockets is so valuable (compared to raw TCP), it's easy to allow-list on the customers firewall and easy to route through nginx/app gateway.

ReubenBond · 2023-04-14T14:59:03Z

No, we have multiple customers running our software in their on-premise, with the cloud portion hosted by us.

That is a non-goal of Orleans: it is not designed to allow untrusted third parties to connect directly to an Orleans cluster. For that, you should use a gateway, eg based on HTTP APIs, SignalR, gRPC, etc.

KSemenenko · 2023-04-15T20:52:01Z

My case - there is a cluster in the US and a cluster in the EU, in each cluster WebAPI client and Silo.
Data is not distributed across regions.
What to do if the user from the US will connected to the EU cluster?
then i would like to be able to webapi from EU make connecttion with US Silo and servce client

ReubenBond · 2023-04-17T15:32:03Z

@KSemenenko, is your case addressed by the Orleans.MultiClient repo? Admittedly, that needs to be updated to support Orleans 7

StephenStrickland · 2024-01-24T17:58:51Z

Any chance this is on the radar for this year?

alimozdemir · 2025-01-26T12:02:10Z

@KSemenenko, is your case addressed by the Orleans.MultiClient repo? Admittedly, that needs to be updated to support Orleans 7

I believe this should be part of Orleans itself with Keyed services. E.g. When I want to get a cluster client I should get that cluster client like [FromKeyedServices("clusterId")]IClusterClient client, what do you think? @ReubenBond

rafikiassumani-msft mentioned this issue Jan 10, 2022

[Epic]: Orleans Enhancements for .NET7 #7467

Closed

20 tasks

ghost added the Needs: triage 🔍 label Jan 10, 2022

rafikiassumani-msft added epic area-deployment Orleans Deployment issues area-hosting and removed Needs: triage 🔍 labels Jan 10, 2022

rafikiassumani-msft changed the title ~~Multi-clustering features (Cost: XXL) - Epic~~ [Epic]: Support Multi-clustering for Orleans Jan 10, 2022

rafikiassumani-msft added this to the .NET 7 Planning milestone Jan 10, 2022

rafikiassumani-msft added the cost: XXL label Jan 10, 2022

rafikiassumani-msft assigned ReubenBond Jan 13, 2022

ReubenBond mentioned this issue Jan 28, 2022

[WIP] SiloAddress enhancements to support metaclusters and more networking scenarios #7517

Open

rafikiassumani-msft modified the milestones: .NET 7 Planning, 4.0.0 Mar 25, 2022

ReubenBond assigned rafikiassumani-msft Apr 19, 2023

VincentH-Net mentioned this issue Dec 4, 2023

Add support of Orleans in Aspire dotnet/aspire#1117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic]: Support Multi-clustering for Orleans #7485

[Epic]: Support Multi-clustering for Orleans #7485

rafikiassumani-msft commented Jan 10, 2022 •

edited

Loading

ReubenBond commented Jan 13, 2022 •

edited

Loading

JesOb commented May 11, 2022

Magazin80 commented Oct 31, 2022

ElanHasson commented Nov 1, 2022

jan-johansson-mr commented Feb 20, 2023 •

edited

Loading

0x53A commented Feb 22, 2023

ReubenBond commented Feb 22, 2023

0x53A commented Feb 23, 2023

ReubenBond commented Apr 14, 2023

KSemenenko commented Apr 15, 2023

ReubenBond commented Apr 17, 2023

StephenStrickland commented Jan 24, 2024

alimozdemir commented Jan 26, 2025

[Epic]: Support Multi-clustering for Orleans #7485

[Epic]: Support Multi-clustering for Orleans #7485

Comments

rafikiassumani-msft commented Jan 10, 2022 • edited Loading

ReubenBond commented Jan 13, 2022 • edited Loading

JesOb commented May 11, 2022

Magazin80 commented Oct 31, 2022

ElanHasson commented Nov 1, 2022

jan-johansson-mr commented Feb 20, 2023 • edited Loading

0x53A commented Feb 22, 2023

ReubenBond commented Feb 22, 2023

0x53A commented Feb 23, 2023

ReubenBond commented Apr 14, 2023

KSemenenko commented Apr 15, 2023

ReubenBond commented Apr 17, 2023

StephenStrickland commented Jan 24, 2024

alimozdemir commented Jan 26, 2025

rafikiassumani-msft commented Jan 10, 2022 •

edited

Loading

ReubenBond commented Jan 13, 2022 •

edited

Loading

jan-johansson-mr commented Feb 20, 2023 •

edited

Loading