Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic]: Support Multi-clustering for Orleans #7485

Open
3 tasks
Tracked by #7467
rafikiassumani-msft opened this issue Jan 10, 2022 · 13 comments
Open
3 tasks
Tracked by #7467

[Epic]: Support Multi-clustering for Orleans #7485

rafikiassumani-msft opened this issue Jan 10, 2022 · 13 comments
Assignees
Milestone

Comments

@rafikiassumani-msft
Copy link

rafikiassumani-msft commented Jan 10, 2022

  • Provide better geo-redundancy (different regions)
  • Data isolation needs in different regions (Country or region data governance requirements)
  • Provide grain activation capabilities at the edge (near)

We might need to add documentation on why this might not be potentially needed for certain use cases.

@ghost ghost added the Needs: triage 🔍 label Jan 10, 2022
@rafikiassumani-msft rafikiassumani-msft changed the title Multi-clustering features (Cost: XXL) - Epic [Epic]: Support Multi-clustering for Orleans Jan 10, 2022
@rafikiassumani-msft rafikiassumani-msft added this to the .NET 7 Planning milestone Jan 10, 2022
@ReubenBond
Copy link
Member

ReubenBond commented Jan 13, 2022

We discussed this with an internal partner team today and have some ideas for how to implement this well. I'm documenting some of the takeaways of the discussion here for our own good.

I suggest we call this feature "metaclusters" because it involves a cluster of clusters (a cluster is a set of collaborating processes, this is a set of sets of processes).

We want to support communication between multiple Orleans clusters, typically geographically separated.

This will involve some changes to Orleans' core and we have a few guiding principles:

  • The feature should be opt-in and pay-for-what-you-use
  • There should be no unavoidable single points of failure in the design
  • Clusters should be able to communicate with each other via a load balancer (TCP/HTTPS) and should not require a shared VPN / full IP-addressability to each server in each cluster.

Our current thinking in terms of design and work:

Metacluster Membership

  • We will add a new provider for mapping ClusterIds to communication endpoints (TCP/HTTPS gateway enpoints) as well as for indicating liveness.
  • The default implementation of this will use a globally accessible table (eg, Azure Table Storage)
  • Administrators can manipulate this table to add/remove clusters from the metacluster (possibly using tooling / management APIs)
  • Enhancement: allow clusters to check each other for liveness and automatically update the table with status (Up/Down) depending on network connectivity. Clusters do not crash & restart when they detect that they are marked Down, instead, they check that they can communicate with the currently Up clusters before resetting their status in the table to Up again.
  • Conceptually, to remove the single-point-of-failure on a globally accessible database in the future, we can use a static list of cluster endpoints as a seed and then use consensus among them to determine which are alive or not. This is complex in itself, so it's best to defer this while keeping the option open.
  • Clusters are responsible for managing liveness of their constituent servers themselves. I.e, there are no cross-cluster per-server liveness checks.
  • Clusters read each other's membership using API calls against the cluster gateway (rather than directly reading membership tables from storage). This can also be used to get cluster manifests, etc, for metacluster-aware placement.

Routing/Networking

  • Servers will connect to each Up endpoint in the metacluster table and route all messages destined for any server in the metacluster table via that connection.
  • Therefore, each server will establish a connection to every server which is a part of its cluster as well as a connection to a single gateway in every other cluster. When the connection drops (eg because that gateway shuts down) they will retry the connection indefinitely (likely with some backoff).

Placement:

  • SiloAddress should have some way of identifying the cluster which it belongs to. That might include embedding a cluster id or it might involve migrating to a string instead of an IPEndPoint, where the string can map to an IPEndPoint for local communication (either via parsing or a mapping included in the membership table (eg, "ClusterX/SiloY" has endpoints: { IPv4: 10.0.0.2:11111 IPv6: ::::2:11111, FQDN: "silox.clustery.internal.contoso.com"}, or something), or an external mapping.
  • Therefore, IGrainLocator implementations have some mechanism for returning foreign addresses. Typically, we imagine that only a subset of grain types will be allowed to exist across clusters (single instance per metacluster). That would use the existing mechanisms (customizable grain locator/directory per grain)
  • Orleans checks that the target silo is live before routing calls to grains. Therefore, the check needs to incorporate some knowledge of the metacluster.
  • It's the job of the developer to configure a placement provider which uses a globally accessible directory: that is not a part of the metacluster feature
  • Even when metaclusters are enabled, we should require that placement providers specifically request "the list of all compatible servers in the metacluster" when performing placement. By default, they should only receive the list of local, compatible servers. An alternative is that we have some interface property (eg, bool IsMetaclusterAware { get; }) before we decide to feed it all compatible servers in the metacluster.

Non-goals:

  • We want to avoid any notion that this is some kind of panacea for globe-spanning applications which automatically do the most optimal thing in any given situation. For example, we are not planning to give Orleans the notion of the inter-cluster link latency or bandwidth and Orleans wont divine which cluster is the best to place a given grain, it has no intrinsic knowledge about data sovereignty, etc. The goal is to provide mechanisms for building globe-spanning applications rather than a packaged solution
  • Grains are either globally accessible or locally accessible, depending on their configure grain locator: if IMyGrain uses a locally-scoped grain directory in each cluster, then sending a reference to it to a globally-scoped grain in some other cluster will result in a reference which is locally scoped to that foreign cluster, not the originating cluster. If there's a need to have per-cluster grains, that can be accomplished by encoding something into the grain id and interpreting that at placement time.

cc @JohnMorman @juyuz

@JesOb
Copy link

JesOb commented May 11, 2022

Does that mean orleans will able to run with ASP .Net Core in Google App Engine Flexible?

@Magazin80
Copy link

Is multi-clustering supported in the latest release v3.X? Documentation says that multi-clustering was removed in v2. Does it mean that we should not plan for it for now? Thanks!

@ElanHasson
Copy link
Contributor

@Magazin80, no mult-clustering in v3.x and as of yet v7.0 (formerly 4.0) . I'll defer to @ReubenBond on the timeline.

@jan-johansson-mr
Copy link
Contributor

jan-johansson-mr commented Feb 20, 2023

Hi @ReubenBond and @ElanHasson,

The support for metaclusters seems good to me, that is, in my mind, a client can use a cluster meta ID that points to a collection of clusters, where the destination Silo may be in any of the "physical" clusters. But I am reacting to a wording in the proposal, and the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

So far, I've been able to use Orleans on-prem and with azure solution, but is the direction of some of the features of Orleans to be locked-in to Azure in the future? This worries me. Another such feature is the transactional feature, where the transactional provider given out-of-the-box in Orleans only works on Azure, there is no provider given, out-of-the-box, that works on ADO. Of course, one can write such a provider, and I did. However, this metaclusters feature, if tailored only for Azure, is something completely different, if this is the intended direction.

Kindly

@0x53A
Copy link

0x53A commented Feb 22, 2023

[...] the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

I assume the term was meant on a technical level, not a billing level. Typically, pay as you go means that if you don't use the feature, you don't "pay" for it in regards to performance, complexity, etc.


From my side, I'd be really interested in this for connecting Cloud (typically, but not necessarily, Azure) and On-Premise.

Currently we use a WCF Server in the cloud, a WCF Client on-premise, and then run duplex communication through a persistent Websocket using the net.http Binding and Client Callbacks.

We would be interested in modernizing this solution. The complexity comes from the fact that the On-Premise client / cluster / whatever you want to call it, is NOT directly internet addressable. Instead it will initiate the connection itself, and then everything needs to be routed through this one persistent connection. If the connection breaks, only the On-Premise part can re-establish the connection.

@ReubenBond
Copy link
Member

[...] the wording is pay as you go. Does that means that this metaclusters feature, if done, only is applicable to Azure?

I assume the term was meant on a technical level, not a billing level. Typically, pay as you go means that if you don't use the feature, you don't "pay" for it in regards to performance, complexity, etc.

Yes, precisely.

The complexity comes from the fact that the On-Premise client / cluster / whatever you want to call it, is NOT directly internet addressable.

In your case, would a solution based upon a VPN be sufficient?

@0x53A
Copy link

0x53A commented Feb 23, 2023

In your case, would a solution based upon a VPN be sufficient?

No, we have multiple customers running our software in their on-premise, with the cloud portion hosted by us. We don't really want to setup hundreds of VPNs. That's why using http/websockets is so valuable (compared to raw TCP), it's easy to allow-list on the customers firewall and easy to route through nginx/app gateway.

@ReubenBond
Copy link
Member

No, we have multiple customers running our software in their on-premise, with the cloud portion hosted by us.

That is a non-goal of Orleans: it is not designed to allow untrusted third parties to connect directly to an Orleans cluster. For that, you should use a gateway, eg based on HTTP APIs, SignalR, gRPC, etc.

@KSemenenko
Copy link

My case - there is a cluster in the US and a cluster in the EU, in each cluster WebAPI client and Silo.
Data is not distributed across regions.
What to do if the user from the US will connected to the EU cluster?
then i would like to be able to webapi from EU make connecttion with US Silo and servce client

@ReubenBond
Copy link
Member

@KSemenenko, is your case addressed by the Orleans.MultiClient repo? Admittedly, that needs to be updated to support Orleans 7

@StephenStrickland
Copy link

Any chance this is on the radar for this year?

@alimozdemir
Copy link

@KSemenenko, is your case addressed by the Orleans.MultiClient repo? Admittedly, that needs to be updated to support Orleans 7

I believe this should be part of Orleans itself with Keyed services. E.g. When I want to get a cluster client I should get that cluster client like [FromKeyedServices("clusterId")]IClusterClient client, what do you think? @ReubenBond

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants