Skip to content

Conversation

@JackStouffer
Copy link
Contributor

@JackStouffer JackStouffer commented Jul 22, 2016

  • Improve docs
  • removed repeated code between this and byUTF
  • removed indiscriminate pure application
  • Range-ified

@9il
Copy link
Member

9il commented Jul 22, 2016

Benchmarks please

@JackStouffer
Copy link
Contributor Author

JackStouffer commented Jul 22, 2016

@9il

import std.stdio;
import std.datetime;
import std.conv;
import std.traits;
import std.utf;

enum testCount = 1_000_000;

string toUTF82(S)(S s) if (isSomeString!S)
{
    import std.array : array;
    return s.byChar.array;
}

void main()
{
    dstring asciiTest = "The quick brown fox jumped over the lazy dog";
    dstring utfTest = "Приве́т नमस्ते שָׁלוֹם";

    auto result  = to!Duration(benchmark!(() => toUTF8(asciiTest))(testCount)[0]);
    auto result2 = to!Duration(benchmark!(() => toUTF82(asciiTest))(testCount)[0]);
    auto result3 = to!Duration(benchmark!(() => toUTF8(utfTest))(testCount)[0]);
    auto result4 = to!Duration(benchmark!(() => toUTF82(utfTest))(testCount)[0]);

    writeln("ASCII w/ old", "\t", result);
    writeln("ASCII w/ new", "\t", result2);
    writeln("UTF w/ old", "\t", result3);
    writeln("UTF w/ new", "\t", result4);
}
$ ldc2 -O5 -release test.d && ./test
ASCII w/ old    162 ms, 44 μs, and 3 hnsecs
ASCII w/ new    877 ms and 814 μs
UTF w/ old  1 sec, 317 ms, 308 μs, and 1 hnsec
UTF w/ new  879 ms, 519 μs, and 6 hnsecs

Due to the lack of a fast track for ASCII in byUTF, this has a significant performance degradation for ASCII.

I will open a PR with a ASCII case in byUTF.

@quickfur
Copy link
Member

ping @DmitryOlshansky
Please review, since you know this module best.

@DmitryOlshansky
Copy link
Member

Then we have to add fast path for ASCII in byUTF

@JackStouffer
Copy link
Contributor Author

Needs #4642

@JackStouffer
Copy link
Contributor Author

Numbers with latest change

$ ldc2 -O5 -release test.d && ./test

ASCII w/ old    150 ms, 528 μs, and 8 hnsecs
ASCII w/ new    601 ms, 348 μs, and 5 hnsecs
UTF w/ old  1 sec, 327 ms, 231 μs, and 5 hnsecs
UTF w/ old  714 ms, 615 μs, and 3 hnsecs

@JackStouffer
Copy link
Contributor Author

JackStouffer commented Jul 23, 2016

Interesting thing to note:

Having even a couple UTF characters in your string causes the performance of the original to drop like a rock. While the ASCII-only situation with the new code is significantly worse, having even a few unicode characters in your string will mean you will get a ~45% speed increase.

This benchmark uses the changes in #4642

$ ldc2 -O5 -release -boundscheck=off test.d && ./test

ASCII w/ old    143 ms, 211 μs, and 9 hnsecs
ASCII w/ new    522 ms and 566 μs
UTF w/ old      1 sec, 444 ms, 939 μs, and 9 hnsecs
UTF w/ old      674 ms, 635 μs, and 9 hnsecs
Mixed w/ old    1 sec, 343 ms, 200 μs, and 9 hnsecs
Mixed w/ new    678 ms, 695 μs, and 3 hnsecs
import std.stdio;
import std.datetime;
import std.conv;
import std.traits;
import std.utf;
import std.range;
import std.algorithm;

enum testCount = 1_000_000;

string toUTF82(S)(S s) if (isSomeString!S)
{
    import std.array : appender;
    auto app = appender!string();
    app.reserve(s.length);
    foreach (c; s.byUTF2!char)
        app.put(c);
    return app.data;
}

template byUTF2(C) if (isSomeChar!C)
{
    static if (!is(Unqual!C == C))
        alias byUTF2 = byUTF2!(Unqual!C);
    else:

    auto ref byUTF2(R)(R r)
        if (isAutodecodableString!R && isInputRange!R && isSomeChar!(ElementEncodingType!R))
    {
        return byUTF2(r.byCodeUnit());
    }

    auto ref byUTF2(R)(R r)
        if (!isAutodecodableString!R && isInputRange!R && isSomeChar!(ElementEncodingType!R))
    {
        alias RC = Unqual!(ElementEncodingType!R);

        static if (is(RC == C))
        {
            return r.byCodeUnit();
        }
        else
        {
            static struct Result
            {
                this(ref R r)
                {
                    this.r = r;
                }

                @property bool empty()
                {
                    return pos == fill && r.empty;
                }

                @property auto front()
                {
                    if (pos == fill)
                    {
                        pos = 0;
                        auto c = r.front;

                        if (c <= 0x7F)
                        {
                            buf[0] = cast(C) c;
                            fill = 1;
                            r.popFront;
                        }
                        else
                        {
                            static if (is(RC == dchar))
                            {
                                fill = cast(ushort) encode!(UseReplacementDchar.yes)(buf, r.front);
                                r.popFront;
                            }
                            else
                            {
                                fill = cast(ushort) encode!(UseReplacementDchar.yes)(
                                    buf, decodeFront!(UseReplacementDchar.yes)(r));
                            }
                        }
                    }
                    return buf[pos];
                }

                void popFront()
                {
                    if (pos == fill)
                        front;
                    ++pos;
                }

                static if (isForwardRange!R)
                {
                    @property auto save()
                    {
                        auto ret = this;
                        ret.r = r.save;
                        return ret;
                    }
                }

            private:

                R r;
                C[4 / C.sizeof] buf = void;
                ushort pos, fill;
            }

            return Result(r);
        }
    }
}

void main()
{
    dstring asciiTest = "The quick brown fox jumped over the lazy dog";
    dstring utfTest = "Приве́т नमस्ते שָׁלוֹם";
    dstring mixedTest = "The quick brown fox Приве́т jumped over the lazy dog";

    auto result  = to!Duration(benchmark!(() => toUTF8(asciiTest))(testCount)[0]);
    auto result2 = to!Duration(benchmark!(() => toUTF82(asciiTest))(testCount)[0]);

    auto result3 = to!Duration(benchmark!(() => toUTF8(utfTest))(testCount)[0]);
    auto result4 = to!Duration(benchmark!(() => toUTF82(utfTest))(testCount)[0]);

    auto result5 = to!Duration(benchmark!(() => toUTF8(mixedTest))(testCount)[0]);
    auto result6 = to!Duration(benchmark!(() => toUTF82(mixedTest))(testCount)[0]);

    writeln("ASCII w/ old", "\t", result);
    writeln("ASCII w/ new", "\t", result2);
    writeln("UTF w/ old", "\t", result3);
    writeln("UTF w/ old", "\t", result4);
    writeln("Mixed w/ old", "\t", result5);
    writeln("Mixed w/ new", "\t", result6);
}

@JackStouffer
Copy link
Contributor Author

Will squish the commits if this is approved

@codecov-io
Copy link

Current coverage is 88.68% (diff: 66.66%)

Merging #4640 into master will increase coverage by <.01%

@@             master      #4640   diff @@
==========================================
  Files           121        121          
  Lines         73827      73810    -17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          65471      65457    -14   
+ Misses         8356       8353     -3   
  Partials          0          0          

Powered by Codecov. Last update 6db08d3...ff96a5e

@andralex
Copy link
Member

Auto-merge toggled on

@andralex
Copy link
Member

@JackStouffer great, please finish

@JackStouffer
Copy link
Contributor Author

Doc failure seems unrelated

@andralex andralex merged commit 76cd4bd into dlang:master Sep 17, 2016
@JackStouffer JackStouffer deleted the utf branch May 22, 2017 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants