[RFC]: Refactor CI/CD

### Motivation.

vLLM's CI/CD has grown in a less than ideal way as it has built up over the years.

We have the following problems:
- CI takes very long, especially on a per commit cycle
- CI has failures that cannot be reproduced on every machine due to numerics
- CI has failures on models that are not the 80-20 of our usage --- which runs per commit
- CI failures in early tests often lead to vLLM not cleaning up properly --- which creates failures across many tests that makes it hard to identify what is wrong
- CI is NOT covering the models that actually matter (since not enough GPU memory) or hardware that actually matters to the majority of our users
- CI is NOT covering performance!

These issues are creating a bad developer experience for vLLM and causes issues like the CI "death spiral", where we get into a cycle of force-merges.

These issues are creating challenges for vLLM's multiple HW backends, as we are unable to get reliable signal that keeps blocking issues.

### Proposed Change.

The CI needs to be completely refactored and the culture around vLLM CI needs to be significantly improved.

Overall Goals:
- Remove V0 tests and migrate any missing coverage into V1
- Reduce per-commit CI to <15 minutes end-to-end
- Refactor all tests using golden strings or numerics to a stable status
- Refocus model testing onto the top 10 models that matter to 99% of users
- Acquire hardware resources that match the deployment patterns which matter to our users

This will require a concerted effort from the vLLM community.

We will appreciate anyone's help in executing on this effort. We are starting next week.

Some good ideas also in this issue: https://github.com/vllm-project/vllm/issues/20218

### Feedback Period.

3 days

### CC List.

@russellb @njhill @kushanam @shajrawi @simon-mo @andy-neuma @dougbtv 
 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Refactor CI/CD #22992

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Refactor CI/CD #22992

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions