-
Notifications
You must be signed in to change notification settings - Fork 10
feat: Add GPT OSS 20B and 120B #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| services: | ||
| gpt_120b_gpu: | ||
| image: nillion/nilai-vllm:latest | ||
| deploy: | ||
| resources: | ||
| reservations: | ||
| devices: | ||
| - driver: nvidia | ||
| count: 1 | ||
| capabilities: [gpu] | ||
|
|
||
| ulimits: | ||
| memlock: -1 | ||
| stack: 67108864 | ||
| env_file: | ||
| - .env | ||
| restart: unless-stopped | ||
| depends_on: | ||
| etcd: | ||
| condition: service_healthy | ||
| command: > | ||
| --model openai/gpt-oss-120b | ||
| --gpu-memory-utilization 0.95 | ||
| --max-model-len 100000 | ||
| --max-num-batched-tokens 100000 | ||
| --tensor-parallel-size 1 | ||
jcabrero marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| --uvicorn-log-level warning | ||
| environment: | ||
| - SVC_HOST=gpt_120b_gpu | ||
| - SVC_PORT=8000 | ||
| - ETCD_HOST=etcd | ||
| - ETCD_PORT=2379 | ||
| - TOOL_SUPPORT=true | ||
| volumes: | ||
| - hugging_face_models:/root/.cache/huggingface # cache models | ||
| healthcheck: | ||
| test: ["CMD", "curl", "-f", "http://localhost:8000/health"] | ||
| interval: 30s | ||
| retries: 3 | ||
| start_period: 60s | ||
| timeout: 10s | ||
| volumes: | ||
| hugging_face_models: | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| services: | ||
| gpt_20b_gpu: | ||
| image: nillion/nilai-vllm:latest | ||
| deploy: | ||
| resources: | ||
| reservations: | ||
| devices: | ||
| - driver: nvidia | ||
| count: 1 | ||
| capabilities: [gpu] | ||
|
|
||
| ulimits: | ||
| memlock: -1 | ||
| stack: 67108864 | ||
| env_file: | ||
| - .env | ||
| restart: unless-stopped | ||
| depends_on: | ||
| etcd: | ||
| condition: service_healthy | ||
| command: > | ||
| --model openai/gpt-oss-20b | ||
| --gpu-memory-utilization 0.85 | ||
| --max-model-len 100000 | ||
| --max-num-batched-tokens 100000 | ||
| --tensor-parallel-size 1 | ||
| --uvicorn-log-level warning | ||
| environment: | ||
| - SVC_HOST=gpt_20b_gpu | ||
| - SVC_PORT=8000 | ||
| - ETCD_HOST=etcd | ||
| - ETCD_PORT=2379 | ||
| - TOOL_SUPPORT=true | ||
| volumes: | ||
| - hugging_face_models:/root/.cache/huggingface # cache models | ||
| healthcheck: | ||
| test: ["CMD", "curl", "-f", "http://localhost:8000/health"] | ||
| interval: 30s | ||
| retries: 3 | ||
| start_period: 60s | ||
| timeout: 10s | ||
| volumes: | ||
| hugging_face_models: |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -109,10 +109,9 @@ async def chat_completion_concurrent_rate_limit(request: Request) -> Tuple[int, | |
| except ValueError: | ||
| raise HTTPException(status_code=400, detail="Invalid request body") | ||
| key = f"chat:{chat_request.model}" | ||
| try: | ||
| limit = MODEL_CONCURRENT_RATE_LIMIT[chat_request.model] | ||
| except KeyError: | ||
| raise HTTPException(status_code=400, detail="Invalid model name") | ||
| limit = MODEL_CONCURRENT_RATE_LIMIT.get( | ||
| chat_request.model, MODEL_CONCURRENT_RATE_LIMIT.get("default", 50) | ||
| ) | ||
|
Comment on lines
+112
to
+114
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change is the most relevant. If the MODEL_CONCURRENT_RATE_LIMIT doesn't exist for such given model, it switches to "default" which should work for any model and otherwise 50. This prevents a failure state in most cases. |
||
| return limit, key | ||
|
|
||
|
|
||
|
|
||
Large diffs are not rendered by default.
Uh oh!
There was an error while loading. Please reload this page.