-
Notifications
You must be signed in to change notification settings - Fork 259
Open
Description
System Info
- ghcr.io/predibase/lorax:a8ca5cb
- Ubuntu 20.04
- GPU A10G
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
docker run --gpus 1 -v ./data:/data -p 8005:80 ghcr.io/predibase/lorax:a8ca5cb \
--prefix-caching true \
--port 80 \
--model-id Open-Orca/Mistral-7B-OpenOrca \
--cuda-memory-fraction 0.8 \
--sharded false \
--max-waiting-tokens 20 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--hostname 0.0.0.0 \
--max-concurrent-requests 512 \
--max-best-of 1 \
--max-batch-prefill-tokens 4096 \
--max-active-adapters 10 \
--adapter-source local \
--adapter-cycle-time-s 2 \
--json-output \
--disable-custom-kernels \
--dtype float16
Expected behavior
The server starts successfully and the prefix-caching works well
Metadata
Metadata
Assignees
Labels
No labels