Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with RunPod as an offline provider #1118

Closed
TheBits opened this issue Apr 11, 2024 · 17 comments
Closed

Issues with RunPod as an offline provider #1118

TheBits opened this issue Apr 11, 2024 · 17 comments

Comments

@TheBits
Copy link
Contributor

TheBits commented Apr 11, 2024

Once the RunPod was added, we observed a few of issues with its functionality.

  1. The availability information is accurate at the moment the API call is made.
  2. The list of offers does not show all possible offers
  3. What problems need to be resolved for RunPod to become an online provider?

There are two issues in which the integrity tests are fixed according to the current availability (dstackai/gpuhunt#56, dstackai/gpuhunt#58).

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

@TheBits
2. The list of offers does not show all possible offers
Are all possible offers not present even during the API call?

@peterschmidt85
Copy link
Contributor

@Bihan
Here's what @TheBits meant:
The static catalog now includes only the offers that were available at the time of its generation.
Later, when a user invokesdstack run, it shows the available offers based on the static catalog that may be not relevant anymore. Thus, the user may not see some of the actually available offers.

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

Once the RunPod was added, we observed a few of issues with its functionality.

  1. The availability information is accurate at the moment the API call is made.
  2. The list of offers does not show all possible offers
  3. What problems need to be resolved for RunPod to become an online provider?

There are two issues in which the integrity tests are fixed according to the current availability (dstackai/gpuhunt#56, dstackai/gpuhunt#58).

@TheBits
3. What problems need to be resolved for RunPod to become an online provider?
A. Best resolution: To make runpod an online provider, runpod should provide a single api which responds with available machines with their gpu counts and datacenter. Current Api requires gpu count as a mandatory variable, also there is no datacenter information.

B. For the case when user provides gpu_count and region filter. Eg: dstack run . -b runpod -r EU-SE-1 --gpu 1, we can make it online. But for case dstack run . -b runpod we need to call api multiple times for datacenter and gpu_count. This multiple calls creates performance issue.

Note: I have observed that Runpod has changed its web layout. May be it might have changed api too. I am checking it.

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

@Bihan Here's what @TheBits meant: The static catalog now includes only the offers that were available at the time of its generation. Later, when a user invokesdstack run, it shows the available offers based on the static catalog that may be not relevant anymore. Thus, the user may not see some of the actually available offers.

@peterschmidt85 Yes that is true.

@peterschmidt85
Copy link
Contributor

peterschmidt85 commented Apr 11, 2024

There is also option C – make our version of the static catalog that includes all the offers.
Each night we will trigger RunPod's API and check that our internal catalog includes it. If we see a new offer, we add it to our catalog.
How about that?
That would take a bit more effort but won't require an API from RunPod.

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

There is also option C – make our version of the static catalog that includes all the offers. Each night we will trigger RunPod's API and check that our internal catalog includes it. If we see a new offer, we add it to our catalog. How about that? That would take a bit more effort but won't require an API from RunPod.

@peterschmidt85 Yes we can do that, but what if catalog changes faster than trigger interval?.

I want to share an idea to make Runpod online. Basically the idea is to follow the flow in which Runpod's web console works.

Eg:
Case A: User requests for all offers dstack run . -b runpod
List catalog offers online using Runpod's get gpu types. The api's response is [ { "maxGpuCount": 8, "id": "NVIDIA A100 80GB PCIe", "displayName": "A100 80GB", "manufacturer": "Nvidia", "memoryInGb": 80, "cudaCores": 0, "secureCloud": true, "communityCloud": true, "securePrice": 1.89, #price for gpu_count = 1 "communityPrice": 1.59, #price for gpu_count = 1 "communitySpotPrice": 0.89, #price for gpu_count = 1 }, {...}, {..}] . This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Case A: User requests with gpu argument dstack run . -b runpod --gpu 3
List catalog offers as above with region = "Any". If the first option has machine with gpu count = 3, then start provisioning else automatically choose subsequent offers with gpu_count = 3.

I can explore the cases and try implementation.

@peterschmidt85
Copy link
Contributor

This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Not sure I'm fond of this one TBH.

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

This response does not provide location, but we don't need it because user has not supplied region argument and we only need the cheapest offer. The region field can have value "Any"

Not sure I'm fond of this one TBH.

@peterschmidt85 Getting datacenter information requires 8(no of datacenter) api calls and is taking 2s. If 2s is a acceptable performance, then I can make Runpod online.

@peterschmidt85
Copy link
Contributor

@Bihan But what about the option C I suggested above?

@Bihan
Copy link
Collaborator

Bihan commented Apr 11, 2024

@Bihan But what about the option C I suggested above?

@peterschmidt85 What if Runpod changes its catalog before the trigger happens?

@TheBits
Copy link
Contributor Author

TheBits commented Apr 11, 2024

In my opinion, 2 seconds is not a significant lag.

But what about the option C I suggested above?

@peterschmidt85 The number of offers with availability fluctuates frequently. At night, there were 207 offers. Right now, the number of offers between 185 and 189.

@peterschmidt85
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@Bihan
Copy link
Collaborator

Bihan commented May 13, 2024

This issue is stale because it has been open for 30 days with no activity.

@peterschmidt85 The solution is to implement Runpod as as online provider. However to implement as an online provider, we require an API which returns all machine types across all data centers. Such API is not offered by Runpod.

We do have a workaround to implement Runpod as an online provider, but the workaround comes with a performance issue. The performance issue is about the response time to get all the offers. It takes 2s to respond with all the offers.

@r4victor
Copy link
Collaborator

Currently dstack uses gpuhunt runpod catalog collected daily. It includes only the offers available at the time of catalog generation. Since runpod availability changes throughout the day, some offer may appear/disappear when user runs dstack run.

A potentially good and simple solution could be to start collecting the runpod catalog more frequently (e.g. every hour). Some offers might still be missing but it won't be critical. The specific interval is to be determined.

Making runpod online provider is not an option at the moment.

@r4victor r4victor removed the major label Jun 11, 2024
@Bihan
Copy link
Collaborator

Bihan commented Jun 11, 2024

Currently dstack uses gpuhunt runpod catalog collected daily. It includes only the offers available at the time of catalog generation. Since runpod availability changes throughout the day, some offer may appear/disappear when user runs dstack run.

A potentially good and simple solution could be to start collecting the runpod catalog more frequently (e.g. every hour). Some offers might still be missing but it won't be critical. The specific interval is to be determined.

Making runpod online provider is not an option at the moment.

@r4victor This means we need to modify github workflow to collect the Runpod catalog every hour, while the existing jobs continue to run daily as before. Should I modify the workflow?

@r4victor
Copy link
Collaborator

@Bihan, yeah, one of the possible solutions would be to separate backend catalogs. This will require refactoring of gpuhunt and also means introducing a new catalog version (v2) since the catalogs will be stored differently.

We can also trigger Collect and publish catalogs workflow more frequently for all providers (e.g. every hour).

@peterschmidt85, this solution won't cost us much and I'd recommend it since it's trivial to start with.

@peterschmidt85
Copy link
Contributor

Agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants