You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenizer is implemented in autopr.utils.tokenizer.get_tokenizer, and called at autopr/utils/repo.py:124 and autopr/repos/completions_repo.py:28. Currently it uses transformers' GPT2TokenizerFast, which isn't the correct way to calculate the token length.
Here's an example from OpenAI's cookbook on how to calculate token length for messages:
import tiktoken
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
"""Returns the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print("Warning: model not found. Using cl100k_base encoding.")
encoding = tiktoken.get_encoding("cl100k_base")
if model == "gpt-3.5-turbo":
print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
elif model == "gpt-4":
print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
return num_tokens_from_messages(messages, model="gpt-4-0314")
elif model == "gpt-3.5-turbo-0301":
tokens_per_message = 4 # every message follows <|start|>{role/name}\n{content}<|end|>\n
tokens_per_name = -1 # if there's a name, the role is omitted
elif model == "gpt-4-0314":
tokens_per_message = 3
tokens_per_name = 1
else:
raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokens
Our implementation should support both messages for chat completions models and simple strings for ordinary completions models (the tokenizer currently supports only simple strings).
The text was updated successfully, but these errors were encountered:
The tokenizer is implemented in
autopr.utils.tokenizer.get_tokenizer
, and called atautopr/utils/repo.py:124
andautopr/repos/completions_repo.py:28
. Currently it uses transformers' GPT2TokenizerFast, which isn't the correct way to calculate the token length.Here's an example from OpenAI's cookbook on how to calculate token length for messages:
Our implementation should support both messages for chat completions models and simple strings for ordinary completions models (the tokenizer currently supports only simple strings).
The text was updated successfully, but these errors were encountered: