Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First round of perf improvements for tiktoken #7012

Merged
merged 1 commit into from
Feb 18, 2024

Conversation

stephentoub
Copy link
Member

Before:

Method Mean Allocated
CountTokensCached 3.677 s 4.82 GB
CountTokensUncached 2.309 s 3.03 GB

After:

Method Mean Allocated
CountTokensCached 2.545 s 637.63 MB
CountTokensUncached 1.627 s 408.34 MB
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using Microsoft.ML.Tokenizers;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private Tokenizer _tokenizer;
    private string[] _tests;

    [GlobalSetup]
    public async Task Setup()
    {
        _tokenizer = await Tokenizer.CreateByModelNameAsync("gpt-3.5-turbo");
        using HttpClient client = new HttpClient();
        string text = string.Concat(Enumerable.Repeat(Poem, 8));
        _tests = new string[8192]; // LruCache size
        for (int i = 0; i < _tests.Length; i++)
        {
            _tests[i] = text.Substring(0, text.Length - i);
        }
    }

    [Benchmark]
    public int CountTokensCached()
    {
        int sum = 0;
        for (int i = 0; i < _tests.Length; i++)
        {
            sum += _tokenizer.CountTokens(_tests[0]); // reuse same input each time
        }
        return sum;
    }

    [Benchmark]
    public int CountTokensUncached()
    {
        int sum = 0;
        for (int i = 0; i < _tests.Length; i++)
        {
            sum += _tokenizer.CountTokens(_tests[i]); // change the input to defeat the cache
        }
        return sum;
    }

    private const string Poem = """
        **Paws of Joy**

        In the morning's tender light,
        When dew-kissed grass awaits the sun,
        There stirs a creature, full of might,
        A friend whose loyalty is never undone.

        **The Dog**, with eyes like galaxies,
        Wags its tail, a metronome of glee,
        Its heart a map of boundless territories,
        Guiding us through life's vast sea.

        **Furry sentinels**, guardians of our hearth,
        They chase their tails in playful mirth,
        Their barks a symphony of love and merriment,
        Echoing through the quiet moments we've spent.

        **Nose to ground**, they follow scents,
        Unraveling mysteries with fervent intent,
        From squirrel trails to forgotten dreams,
        They lead us to places we've never seen.

        **Golden retrievers** with hearts of gold,
        **Dachshunds** with determination untold,
        **Greyhounds** racing against the wind,
        Each breed a chapter in the story they've pinned.

        **Labradors** dive into lakes with glee,
        **Chihuahuas** strut like tiny royalty,
        **Huskies** howl at the moon's silver glow,
        And **puppies**, oh sweet puppies, steal the show.

        Their eyes speak of trust, unwavering and true,
        Their fur holds secrets whispered by the dew,
        In their presence, worries seem to fade,
        As they teach us the art of living unafraid.

        So here's to the dogs, our steadfast friends,
        Who mend our hearts and heal life's bends,
        May their tails forever wag, their noses explore,
        For in their love, we find solace evermore.
        """;
}

@stephentoub
Copy link
Member Author

cc: @tarekgh

{
string? line = reader.ReadLine();
string? line = useAsync ?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a reliability/scalability fix rather than throughput. The asynchronous creation code path and the synchronous creation code path were both sharing this routine, which resulted in the asynchronous code path doing synchronous I/O (on a network stream). Now the async path does async and the sync path does sync, still sharing this same routine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't always do async operation here regardless if the code is coming from sync or async callers?

Copy link
Member Author

@stephentoub stephentoub Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if this is async, a sync caller will be forced to block this thread until the operation completes. The operation could easily require a thread pool thread to complete, yet if this is on a thread pool thread, it'd be blocking one of the very resources needed to make forward progress. The thing less scalable than sync I/O is sync-over-async I/O.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I thought you wanted to have a constructor which execute async too. no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do (a factory method rather than a ctor, but yeah). It will use this.

Copy link

codecov bot commented Feb 17, 2024

Codecov Report

Attention: 67 lines in your changes are missing coverage. Please review.

Comparison is base (4635a86) 68.80% compared to head (ed88215) 68.81%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7012   +/-   ##
=======================================
  Coverage   68.80%   68.81%           
=======================================
  Files        1258     1258           
  Lines      250652   250643    -9     
  Branches    25602    25606    +4     
=======================================
  Hits       172472   172472           
+ Misses      71548    71546    -2     
+ Partials     6632     6625    -7     
Flag Coverage Δ
Debug 68.81% <70.08%> (+<0.01%) ⬆️
production 63.28% <70.08%> (+<0.01%) ⬆️
test 88.44% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/TokenizerResult.cs 100.00% <100.00%> (ø)
...Microsoft.ML.Tokenizers/Utils/ByteArrayComparer.cs 100.00% <100.00%> (+35.29%) ⬆️
...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs 67.36% <85.71%> (ø)
...rc/Microsoft.ML.Tokenizers/PreTokenizer/Roberta.cs 66.66% <80.00%> (+9.52%) ⬆️
...ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs 90.24% <94.73%> (+12.58%) ⬆️
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 94.82% <75.00%> (-0.42%) ⬇️
...crosoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs 94.44% <90.00%> (+8.73%) ⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 83.61% <88.88%> (+0.20%) ⬆️
...c/Microsoft.ML.Tokenizers/Utils/IListExtensions.cs 25.00% <0.00%> (-16.67%) ⬇️
... and 1 more

... and 4 files with indirect coverage changes

@tarekgh
Copy link
Member

tarekgh commented Feb 18, 2024

Would this better handled in the csproj? include this file only when we target net8.0.


Refers to: src/Microsoft.ML.Tokenizers/AssemblyInfo.cs:8 in ed88215. [](commit_id = ed88215, deletion_comment = False)


private static unsafe int GetUtf8Bytes(ReadOnlySpan<char> source, Span<byte> destination)
{
#if NETCOREAPP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have .netstandard and .netcoreapp cs files in the project. Will be good to move this code into these twp files and avoid using #if.

{
if ((uint)utf8ByteCount + (uint)tokenBytes.Length > (uint)utf8Bytes.Length)
{
ArrayPoolGrow(ref utf8Bytes, ref arrayPoolArray, utf8ByteCount + tokenBytes.Length);
Copy link
Member

@tarekgh tarekgh Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrayPoolGrow(ref utf8Bytes, ref arrayPoolArray, utf8ByteCount + tokenBytes.Length);

should we grow the array with more spaces just in case more stuff we can add later without regrowing it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, it is not going to grow more in this path.

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing. Thanks a lot @stephentoub.

I'll merge this and I can do the #if cleanup things in another PR.

@tarekgh tarekgh merged commit f976424 into dotnet:main Feb 18, 2024
25 checks passed
@tarekgh tarekgh added enhancement New feature or request and removed community-contribution labels Feb 18, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants