Skip to content

Fail to run server with prefix-caching option #599

@prd-tuong-nguyen

Description

@prd-tuong-nguyen

System Info

  • ghcr.io/predibase/lorax:a8ca5cb
  • Ubuntu 20.04
  • GPU A10G

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

docker run --gpus 1 -v ./data:/data -p 8005:80 ghcr.io/predibase/lorax:a8ca5cb \
  --prefix-caching true \
  --port 80 \
  --model-id Open-Orca/Mistral-7B-OpenOrca \
  --cuda-memory-fraction 0.8 \
  --sharded false \
  --max-waiting-tokens 20 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --hostname 0.0.0.0 \
  --max-concurrent-requests 512 \
  --max-best-of 1  \
  --max-batch-prefill-tokens 4096 \
  --max-active-adapters 10 \
  --adapter-source local \
  --adapter-cycle-time-s 2 \
  --json-output \
  --disable-custom-kernels \
  --dtype float16

Expected behavior

The server starts successfully and the prefix-caching works well

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions