Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically choose workspace-cluster based on lowest latency. #5596

Open
meysholdt opened this issue Sep 8, 2021 · 12 comments
Open

Automatically choose workspace-cluster based on lowest latency. #5596

meysholdt opened this issue Sep 8, 2021 · 12 comments
Labels
aspect: performance anything related to performance meta: never-stale This issue can never become stale team: webapp Issue belongs to the WebApp team type: feature request New feature or request

Comments

@meysholdt
Copy link
Member

Context: #5534 (comment)

Problem Statement

We currently have workspace clusters in one region in the EU and one region in the US. To offer service at a good latency (e.g. < 100ms), we will need more clusters, maybe as many as one or two per continent. See https://gcping.com/ for your personal latency to every google cloud region. See the GCP network map for available regions and connections between them.

Prior Art

Proposed Solution

The user's web browser should measure the latency for every available workspace cluster and send the measurements to the gitpod-server, so that the server can make an informed decision about what workspace-cluster is best for the user.

Considerations

  • latency measurement should not slow down workspace startup time
  • the decision what workspace-cluster to choose should remain with the gitpod-server, because in the future, other factors besides latency may influence the decision: Example: cluster health.

Proposed Design Choices:

  • to keep workspace startup fast, the latency measurement should be cached. For example in a cookie in the web-browser.
  • to keep workspace startup fast, the latency measurement should preferable not be done when a workspace starts, but when a user visits any website of gitpod.
  • every workspace clusters should have a public endpoint that can be "pinged" from the web browser for latency measurement.
  • the server should make a cache-key and the ws-cluster-endpoints available to the users. The cache-key should encode the public IP address of the user, so that the latency will be measured again if the user changes his/her network.

Example Flow 1:

  1. the user visit gitpod.io/workspaces.
  2. the users browser receives {'cache-key': 'FJJDSKD', "clusters": {"us07": "https://us07.gitpod.io/ping", "sing01": "https://sing01.gitpod.io/ping" } }
  3. the user browser measures the latency to all clusters in the background and stores the result in a cookie: {"us07": 230, "sing01": 60}
  4. When the user opens a workspace, the cookie will be send to the gitpod-server and the server will use the latency measurement to chose the best workspace cluster.

Example Flow 2:

  1. the user opens a workspace. The cookie is already there. No delay during workspace-start.

Example Flow 3:

  1. the user opens a workspace. The cookie is not yet there. The is the case we want to avoid, but I don't think it can be avoided all the time.
  2. measure the latency. Maybe the measurement can be aborted when the first workspace-cluster responds, because the first to respond will also be the one with the lowest latency (duh!). While there is the risk that the measurement is slightly inaccurate and repeated measurements would be needed for more accurate results, it seems like a good compromise to preserve fast workspace startup time. This way, if not cookie is present, 15 to ~200 ms will be added to to the workspace startup time.
@csweichel
Copy link
Contributor

Excellent idea - but we really don't have time for this right now.
We'll want to revisit workspace cluster selection once we make a decision on multi-meta.

@jankeromnes
Copy link
Contributor

jankeromnes commented Sep 16, 2021

Prior Art

FYI, that proposal is to temporarily gather ping times to all possible GCP regions, in order to decide "where should we create a brand new cluster next?" (and then stop collecting ping times, make a decision, and create the cluster)

The proposal was not to collect ping times in order to decide "which workspace cluster should be used right now?" -- doesn't GCP's load balancer already do that automatically? How does the US vs EU selection work right now? (I assume it's not some custom code we wrote, but GCP selecting a reasonable cluster automatically -- I would hope this would also work with 3 or more clusters without requiring us to write custom code for this)

@bigint
Copy link

bigint commented Sep 16, 2021

I think the selection algorithm is broken, Im from India the nearby location is EU but whenever I fire a new workspace it gets created in the US region.

Also I tried with VPN from Vienna that time it created under EU region

🤔

@jankeromnes
Copy link
Contributor

jankeromnes commented Oct 11, 2021

⚠️ Just to re-iterate: This issue suspiciously sounds like we want to re-implement something as standard as a load balancer.

I don't think we want to implement and maintain custom code that measures latency, caches it, and acts upon this data.

If possible, it would be much preferable to let Google Cloud pick the best workspace cluster automatically(!)


Inspiration: Best practices for Compute Engine regions selection > Use Cloud Load Balancing and Cloud CDN:

Cloud Load Balancing, such as HTTP(S) load balancing, TCP, and SSL proxy load balancing, let you automatically redirect users to the closest region where there are backends with available capacity.

@csweichel
Copy link
Contributor

I don't think we want to implement and maintain custom code that measures latency, caches it, and acts upon this data.

If possible, it would be much preferable to let Google Cloud pick the best workspace cluster automatically(!)

Cloud Load Balancing, such as HTTP(S) load balancing, TCP, and SSL proxy load balancing, let you automatically redirect users to the closest region where there are backends with available capacity.

The reason we need to build/maintain something ourselves is that the StartWorkspace request which would need to be regional does not go through a regional load balancer, because it's issued from server to ws-manager, and not from the (regional) user's browser.

@csweichel
Copy link
Contributor

csweichel commented Nov 15, 2021

The minimal steps to make automatic cluster choices would be:

  1. add a kind of "ping" endpoint to ws-proxy, so that e.g. ws-eu18.gitpod.io does not answer with 404
  2. add a getAllRegions function to WorkspaceManagerClientProvider which returns a list of ping URLs and names.
  3. make the dashboard execute the RTT pings as outlined above.
  4. extend the createWorkspace and startWorkspace calls on server so that they take a cluster preference, which would then be passed in via the ExtendedUser and become an admission preference. Note, this way the cluster preference plays nicely with the score and cluster status.

Offline we discussed the option of making the workspace cluster (or region) choice explicit on the dashboard. By default we'd select the cluster with the lowest RTT (as outlined above).

However, focusing on the individual cluster instead of a region has several drawbacks:

  • it's noisy on the dashboard because clusters change very often (with every new workspace deployment)
  • we need to measure often because of the many cluster changes

Instead, we could introduce a region to clusters. We'd introduce a new region field as admission constraint and on the ws-manager-bridge API. New cluster registrations could provide the region when they're registered. We'd assume that from a latency perspective all regional clusters are equivalent, i.e. a measurement to one cluster is equivalent to that of another within the same region.

@jankeromnes jankeromnes added aspect: performance anything related to performance type: feature request New feature or request labels Dec 8, 2021
@meysholdt
Copy link
Member Author

Not sure why this got labeled "platform". The enhancements would mostly need to happen in components owned by the meta team.

@meysholdt meysholdt added team: webapp Issue belongs to the WebApp team and removed team: devx labels Dec 22, 2021
@csweichel csweichel moved this to In Progress in 🌌 Workspace Team Jan 6, 2022
@atduarte atduarte changed the title Automatically chose workspace-cluster based on lowest latency. Automatically choose workspace-cluster based on lowest latency. Jan 29, 2022
@kylos101 kylos101 moved this from In Progress to Scheduled in 🌌 Workspace Team Jan 31, 2022
@kylos101 kylos101 removed the status in 🌌 Workspace Team Feb 24, 2022
@stale
Copy link

stale bot commented Apr 30, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Apr 30, 2022
@bigint
Copy link

bigint commented Apr 30, 2022

This is still not yet fixed 🤔

From India it always choose us clusters instead of eu

@stale stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Apr 30, 2022
@stale
Copy link

stale bot commented Jul 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jul 31, 2022
@stale stale bot closed this as completed Aug 13, 2022
@stale stale bot moved this to Done in 🍎 WebApp Team Aug 13, 2022
@chientrm
Copy link

Nah. Just get the coordinate of the user via IP address and pick the nearest server. Every server should be located in a city. AFAIK Gitpod's running on GCP.
Moreover, many cloud provider like CloudFlare Pages/Worker already append IP and lat/long in HTTP request header 🤭.

@meysholdt meysholdt reopened this Sep 21, 2022
Repository owner moved this from Done to In Progress in 🍎 WebApp Team Sep 21, 2022
@axonasif axonasif removed the status in 🍎 WebApp Team Sep 21, 2022
@stale stale bot closed this as completed Oct 19, 2022
@stale stale bot moved this to In Validation in 🍎 WebApp Team Oct 19, 2022
@kylos101 kylos101 reopened this Dec 8, 2022
Repository owner moved this from In Validation to Scheduled in 🍎 WebApp Team Dec 8, 2022
@kylos101
Copy link
Contributor

kylos101 commented Dec 8, 2022

👋 @geropl reopening, perhaps something we can discuss to see if it can be included in an iteration early next year?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: performance anything related to performance meta: never-stale This issue can never become stale team: webapp Issue belongs to the WebApp team type: feature request New feature or request
Projects
Status: Scheduled
Development

Successfully merging a pull request may close this issue.

6 participants