Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. #1429

Merged
merged 6 commits into from
May 8, 2024

Conversation

mikeshi80
Copy link
Contributor

@mikeshi80 mikeshi80 commented May 6, 2024

Revert codeqwen1.5 to 64k context length.

Also optimized code to avoid query the whether cuda device exists and count the cuda device to boost performance.

@XprobeBot XprobeBot added the gpu label May 6, 2024
@XprobeBot XprobeBot added this to the v0.11.0 milestone May 6, 2024
@qinxuye qinxuye changed the title Remove max tokens limitation ENH: Remove max tokens limitation May 6, 2024
@XprobeBot XprobeBot added the enhancement New feature or request label May 6, 2024
@qinxuye
Copy link
Contributor

qinxuye commented May 6, 2024

Can you show that without modified the max_token_field, if max_tokens is passed(e.g. 64k), there is issue with chat?

@mikeshi80 mikeshi80 changed the title ENH: Remove max tokens limitation ENH: Boost performance by avoid unnecessary repeated cuda device detection. May 6, 2024
@mikeshi80 mikeshi80 changed the title ENH: Boost performance by avoid unnecessary repeated cuda device detection. ENH: Removed the max tokens limitation and boost performance by avoid unnecessary repeated cuda device detection. May 7, 2024
@mikeshi80
Copy link
Contributor Author

Because max_tokens field has the limitation that the value cannot be greater than 32768, when the max_tokens exceeds it, the server will report the error as below:

xinference-local: raise validation_error
xinference-local: pydantic.v1.error_wrappers.ValidationError: 1 validation error for CreateChatCompletion
xinference-local: max_tokens
xinference-local: ensure this value is less than or equal to 32768 (type=value_error.number.not_le; limit_value=32768)

And I printed the CreateChatCompletion's schema with schema_json, NOTICE: some properties have been reduced in case of too long text.

{
  "title": "CreateChatCompletion",
  "type": "object",
  "properties": {
    "frequency_penalty": {
      "title": "Frequency Penalty",
      "default": 0.0,
      "type": "number"
    },
    "logit_bias": {
      "title": "Logit Bias",
      "type": "object",
      "additionalProperties": {
        "type": "number"
      }
    },
    "logprobs": {
      "title": "Logprobs",
      "type": "integer"
    },
    "max_tokens": {
      "title": "Max Tokens",
      "description": "The maximum number of tokens to generate.",
      "default": 1024,
      "minimum": 1,
      "maximum": 32768,
      "type": "integer"
    },
    "stream_interval": {
      "title": "Stream Interval",
      "default": 2,
      "type": "integer"
    }
  },
  "required": [
    "model"
  ],
  "additionalProperties": false
}

The limitation for max_tokens is defined in schema, so the CreateChatCompletion.parse_obj will raise errors if max_tokens exceeded 32768.

@qinxuye
Copy link
Contributor

qinxuye commented May 8, 2024

Could you rebase the main branch?

@mikeshi80
Copy link
Contributor Author

synchronized

Copy link
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit dc8124b into xorbitsai:main May 8, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants