Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance improvements, #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kayhantolga
Copy link

Hello,

Thank you for your hard work on the project. I made some improvements to the code with help from ChatGPT, and it resulted in a 40% increase in performance. I plan to use the improved version of the tokenizer in my OpenAI SDK project, and I suggest you consider implementing the changes too. All the unit tests passed, but my ReSharper made some unnecessary code changes. I apologize for that.

Additionally, I think changing BYTES_TO_UNICODE_CACHE to a Dictionary may provide a slight performance boost, but I'm not sure if it's safe. I don't have enough knowledge on this topic yet.

my repo: https://github.com/betalgo/openai
benchmark1

lang version support changed to latest
@dluc
Copy link
Owner

dluc commented Mar 22, 2023

thank you @kayhantolga this is great! :-)

FYI I'm also porting the tokenizer into Semantic Kernel ("SK"), and I will take these improvements over there too, perfect timing. As a plus with SK, the library will be available also with a nuget signed by Microsoft, which is safer for Prod.

The repo here will remain up anyway, and I hope to check in some other work in future.

By the way, since you're looking at this: I found that OpenAI regex seems to be vulnerable to ReDoS.
Any chance you looked into that too?

string pat = @"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+";

I'm going to add a timeout to the regex, you might want to copy that too.

@dluc
Copy link
Owner

dluc commented Mar 23, 2023

@kayhantolga I tried the changes with a single big file and I could not see perf changes, so looks like the difference is only about the timing when the cache instances are created. Could you share some info about the stress test you ran?

@kayhantolga
Copy link
Author

I was thinking of publishing my tests, but I realized they are a mess at the moment. I need to tidy them up, but I attached a screenshot which you can check.😅

My improvements focused on caching regular expressions. This improvement is most noticeable when receiving many requests with small pieces of data.

Unfortunately, I do not have experience with regex, so I can't help with ReDoS.

Another improvement we made in our repository is returning yield results and adding another method that will only return the token count. These changes were not included in the benchmark, but I believe they will also improve performance. (shared code piece)

image

    /// <summary>
    ///     Encode This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static IEnumerable<int> Encode(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            yield break;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        foreach (var newToken in GetNewTokens(text))
        {
            yield return TokenizerGpt3Settings.Encoder[newToken];
        }
    }

    /// <summary>
    ///     Get token count. This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static int TokenCount(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            return 0;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        return GetNewTokens(text).Count();
    }

    private static IEnumerable<string> GetNewTokens(string text)
    {
        var byteEncoder = BytesToUnicodeCache;
        var matches = EncodingRegex.Matches(text);

        foreach (var match in matches.Cast<Match>())
        {
            var tokenBytes = Encoding.UTF8.GetBytes(match.Value);
            var token = new string(Array.ConvertAll(tokenBytes, x => byteEncoder[x]));
            var newTokens = BytePairEncoding(token).Split(' ');

            foreach (var newToken in newTokens)
            {
                yield return newToken;
            }
        }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants