Replace GPT2TokenizerFast with tiktoken #43

irgolic · 2023-04-11T19:13:46Z

The tokenizer is implemented in autopr.utils.tokenizer.get_tokenizer, and called at autopr/utils/repo.py:124 and autopr/repos/completions_repo.py:28. Currently it uses transformers' GPT2TokenizerFast, which isn't the correct way to calculate the token length.

Here's an example from OpenAI's cookbook on how to calculate token length for messages:

import tiktoken

def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301"):
    """Returns the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model == "gpt-3.5-turbo":
        print("Warning: gpt-3.5-turbo may change over time. Returning num tokens assuming gpt-3.5-turbo-0301.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0301")
    elif model == "gpt-4":
        print("Warning: gpt-4 may change over time. Returning num tokens assuming gpt-4-0314.")
        return num_tokens_from_messages(messages, model="gpt-4-0314")
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif model == "gpt-4-0314":
        tokens_per_message = 3
        tokens_per_name = 1
    else:
        raise NotImplementedError(f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens.""")
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

Our implementation should support both messages for chat completions models and simple strings for ordinary completions models (the tokenizer currently supports only simple strings).

The text was updated successfully, but these errors were encountered:

Noezor · 2023-04-12T07:39:55Z

Very cool! good demo, I recommend you put it in the README

irgolic added AutoPR 🚀 and removed AutoPR 🚀 labels Apr 11, 2023

github-actions bot mentioned this issue Apr 11, 2023

Replace GPT2TokenizerFast with tiktoken-based tokenizer (#44) #44

Closed

irgolic added AutoPR 🚀 and removed AutoPR 🚀 labels Apr 11, 2023

irgolic closed this as completed Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace GPT2TokenizerFast with tiktoken #43

Replace GPT2TokenizerFast with tiktoken #43

irgolic commented Apr 11, 2023

Noezor commented Apr 12, 2023

Replace GPT2TokenizerFast with tiktoken #43

Replace GPT2TokenizerFast with tiktoken #43

Comments

irgolic commented Apr 11, 2023

Noezor commented Apr 12, 2023