performance improvements, #2

kayhantolga · 2023-02-24T20:29:36Z

Hello,

Thank you for your hard work on the project. I made some improvements to the code with help from ChatGPT, and it resulted in a 40% increase in performance. I plan to use the improved version of the tokenizer in my OpenAI SDK project, and I suggest you consider implementing the changes too. All the unit tests passed, but my ReSharper made some unnecessary code changes. I apologize for that.

Additionally, I think changing BYTES_TO_UNICODE_CACHE to a Dictionary may provide a slight performance boost, but I'm not sure if it's safe. I don't have enough knowledge on this topic yet.

my repo: https://github.com/betalgo/openai

lang version support changed to latest

dluc · 2023-03-22T07:10:13Z

thank you @kayhantolga this is great! :-)

FYI I'm also porting the tokenizer into Semantic Kernel ("SK"), and I will take these improvements over there too, perfect timing. As a plus with SK, the library will be available also with a nuget signed by Microsoft, which is safer for Prod.

The repo here will remain up anyway, and I hope to check in some other work in future.

By the way, since you're looking at this: I found that OpenAI regex seems to be vulnerable to ReDoS.
Any chance you looked into that too?

string pat = @"'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+";

I'm going to add a timeout to the regex, you might want to copy that too.

dluc · 2023-03-23T07:19:48Z

@kayhantolga I tried the changes with a single big file and I could not see perf changes, so looks like the difference is only about the timing when the cache instances are created. Could you share some info about the stress test you ran?

kayhantolga · 2023-03-23T10:10:26Z

I was thinking of publishing my tests, but I realized they are a mess at the moment. I need to tidy them up, but I attached a screenshot which you can check.😅

My improvements focused on caching regular expressions. This improvement is most noticeable when receiving many requests with small pieces of data.

Unfortunately, I do not have experience with regex, so I can't help with ReDoS.

Another improvement we made in our repository is returning yield results and adding another method that will only return the token count. These changes were not included in the benchmark, but I believe they will also improve performance. (shared code piece)

    /// <summary>
    ///     Encode This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static IEnumerable<int> Encode(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            yield break;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        foreach (var newToken in GetNewTokens(text))
        {
            yield return TokenizerGpt3Settings.Encoder[newToken];
        }
    }

    /// <summary>
    ///     Get token count. This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
    /// </summary>
    /// <param name="text"></param>
    /// <param name="cleanUpCREOL">set it true to get rid of CR</param>
    /// <returns></returns>
    public static int TokenCount(string text, bool cleanUpCREOL = false)
    {
        if (string.IsNullOrEmpty(text))
        {
            return 0;
        }

        if (cleanUpCREOL)
        {
            text = text.Replace("\r", "");
        }

        return GetNewTokens(text).Count();
    }

    private static IEnumerable<string> GetNewTokens(string text)
    {
        var byteEncoder = BytesToUnicodeCache;
        var matches = EncodingRegex.Matches(text);

        foreach (var match in matches.Cast<Match>())
        {
            var tokenBytes = Encoding.UTF8.GetBytes(match.Value);
            var token = new string(Array.ConvertAll(tokenBytes, x => byteEncoder[x]));
            var newTokens = BytePairEncoding(token).Split(' ');

            foreach (var newToken in newTokens)
            {
                yield return newToken;
            }
        }
    }

performance improvements,

dd87a32

lang version support changed to latest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance improvements, #2

performance improvements, #2

kayhantolga commented Feb 24, 2023

dluc commented Mar 22, 2023 •

edited

Loading

dluc commented Mar 23, 2023

kayhantolga commented Mar 23, 2023

performance improvements, #2

Are you sure you want to change the base?

performance improvements, #2

Conversation

kayhantolga commented Feb 24, 2023

dluc commented Mar 22, 2023 • edited Loading

dluc commented Mar 23, 2023

kayhantolga commented Mar 23, 2023

dluc commented Mar 22, 2023 •

edited

Loading