ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

mikeshi80 · 2024-05-06T09:22:50Z

Revert codeqwen1.5 to 64k context length.

Also optimized code to avoid query the whether cuda device exists and count the cuda device to boost performance.

… of codeqwen1.5-chat to 65536

…mance

qinxuye · 2024-05-06T09:49:29Z

Can you show that without modified the max_token_field, if max_tokens is passed(e.g. 64k), there is issue with chat?

mikeshi80 · 2024-05-07T03:22:14Z

Because max_tokens field has the limitation that the value cannot be greater than 32768, when the max_tokens exceeds it, the server will report the error as below:

xinference-local: raise validation_error
xinference-local: pydantic.v1.error_wrappers.ValidationError: 1 validation error for CreateChatCompletion
xinference-local: max_tokens
xinference-local: ensure this value is less than or equal to 32768 (type=value_error.number.not_le; limit_value=32768)

And I printed the CreateChatCompletion's schema with schema_json, NOTICE: some properties have been reduced in case of too long text.

{
  "title": "CreateChatCompletion",
  "type": "object",
  "properties": {
    "frequency_penalty": {
      "title": "Frequency Penalty",
      "default": 0.0,
      "type": "number"
    },
    "logit_bias": {
      "title": "Logit Bias",
      "type": "object",
      "additionalProperties": {
        "type": "number"
      }
    },
    "logprobs": {
      "title": "Logprobs",
      "type": "integer"
    },
    "max_tokens": {
      "title": "Max Tokens",
      "description": "The maximum number of tokens to generate.",
      "default": 1024,
      "minimum": 1,
      "maximum": 32768,
      "type": "integer"
    },
    "stream_interval": {
      "title": "Stream Interval",
      "default": 2,
      "type": "integer"
    }
  },
  "required": [
    "model"
  ],
  "additionalProperties": false
}

The limitation for max_tokens is defined in schema, so the CreateChatCompletion.parse_obj will raise errors if max_tokens exceeded 32768.

qinxuye · 2024-05-08T05:53:07Z

Could you rebase the main branch?

mikeshi80 · 2024-05-08T06:18:57Z

synchronized

qinxuye

LGTM

mikeshi80 and others added 3 commits May 6, 2024 15:00

removed the limition for max tokens, and recovered the context_length…

491f132

… of codeqwen1.5-chat to 65536

Merge branch 'xorbitsai:main' into remove_max_tokens_limitation

f2aa96c

cache the has_cuda_device and get_cuda_count result for better perfor…

4b811c9

…mance

XprobeBot added the gpu label May 6, 2024

XprobeBot added this to the v0.11.0 milestone May 6, 2024

qinxuye changed the title ~~Remove max tokens limitation~~ ENH: Remove max tokens limitation May 6, 2024

XprobeBot added the enhancement New feature or request label May 6, 2024

reverted the 32768 limitation, which is proved unrelated.

38d175e

mikeshi80 changed the title ~~ENH: Remove max tokens limitation~~ ENH: Boost performance by avoid unnecessary repeated cuda device detection. May 6, 2024

removed the limitation for max tokens

eabc3a1

mikeshi80 changed the title ~~ENH: Boost performance by avoid unnecessary repeated cuda device detection.~~ ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. May 7, 2024

Merge branch 'xorbitsai:main' into remove_max_tokens_limitation

17732d8

qinxuye approved these changes May 8, 2024

View reviewed changes

qinxuye merged commit dc8124b into xorbitsai:main May 8, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

mikeshi80 commented May 6, 2024 •

edited by qinxuye

Loading

qinxuye commented May 6, 2024

mikeshi80 commented May 7, 2024

qinxuye commented May 8, 2024

mikeshi80 commented May 8, 2024

qinxuye left a comment

ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

Conversation

mikeshi80 commented May 6, 2024 • edited by qinxuye Loading

qinxuye commented May 6, 2024

mikeshi80 commented May 7, 2024

qinxuye commented May 8, 2024

mikeshi80 commented May 8, 2024

qinxuye left a comment

Choose a reason for hiding this comment

mikeshi80 commented May 6, 2024 •

edited by qinxuye

Loading