-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance improvements, #2
base: main
Are you sure you want to change the base?
performance improvements, #2
Conversation
lang version support changed to latest
thank you @kayhantolga this is great! :-) FYI I'm also porting the tokenizer into Semantic Kernel ("SK"), and I will take these improvements over there too, perfect timing. As a plus with SK, the library will be available also with a nuget signed by Microsoft, which is safer for Prod. The repo here will remain up anyway, and I hope to check in some other work in future. By the way, since you're looking at this: I found that OpenAI regex seems to be vulnerable to ReDoS.
I'm going to add a timeout to the regex, you might want to copy that too. |
@kayhantolga I tried the changes with a single big file and I could not see perf changes, so looks like the difference is only about the timing when the cache instances are created. Could you share some info about the stress test you ran? |
I was thinking of publishing my tests, but I realized they are a mess at the moment. I need to tidy them up, but I attached a screenshot which you can check.😅 My improvements focused on caching regular expressions. This improvement is most noticeable when receiving many requests with small pieces of data. Unfortunately, I do not have experience with regex, so I can't help with ReDoS. Another improvement we made in our repository is returning yield results and adding another method that will only return the token count. These changes were not included in the benchmark, but I believe they will also improve performance. (shared code piece) /// <summary>
/// Encode This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
/// </summary>
/// <param name="text"></param>
/// <param name="cleanUpCREOL">set it true to get rid of CR</param>
/// <returns></returns>
public static IEnumerable<int> Encode(string text, bool cleanUpCREOL = false)
{
if (string.IsNullOrEmpty(text))
{
yield break;
}
if (cleanUpCREOL)
{
text = text.Replace("\r", "");
}
foreach (var newToken in GetNewTokens(text))
{
yield return TokenizerGpt3Settings.Encoder[newToken];
}
}
/// <summary>
/// Get token count. This method use LF style EOL, if you use CR LF style EOL you need to set cleanUpWindowsEOL to true
/// </summary>
/// <param name="text"></param>
/// <param name="cleanUpCREOL">set it true to get rid of CR</param>
/// <returns></returns>
public static int TokenCount(string text, bool cleanUpCREOL = false)
{
if (string.IsNullOrEmpty(text))
{
return 0;
}
if (cleanUpCREOL)
{
text = text.Replace("\r", "");
}
return GetNewTokens(text).Count();
}
private static IEnumerable<string> GetNewTokens(string text)
{
var byteEncoder = BytesToUnicodeCache;
var matches = EncodingRegex.Matches(text);
foreach (var match in matches.Cast<Match>())
{
var tokenBytes = Encoding.UTF8.GetBytes(match.Value);
var token = new string(Array.ConvertAll(tokenBytes, x => byteEncoder[x]));
var newTokens = BytePairEncoding(token).Split(' ');
foreach (var newToken in newTokens)
{
yield return newToken;
}
}
} |
Hello,
Thank you for your hard work on the project. I made some improvements to the code with help from ChatGPT, and it resulted in a 40% increase in performance. I plan to use the improved version of the tokenizer in my OpenAI SDK project, and I suggest you consider implementing the changes too. All the unit tests passed, but my ReSharper made some unnecessary code changes. I apologize for that.
Additionally, I think changing BYTES_TO_UNICODE_CACHE to a Dictionary may provide a slight performance boost, but I'm not sure if it's safe. I don't have enough knowledge on this topic yet.
my repo: https://github.com/betalgo/openai