Make std.utf.toUTF8 DRY by using byUTF internally #4640

JackStouffer · 2016-07-22T14:47:54Z

Improve docs
removed repeated code between this and byUTF
removed indiscriminate pure application
Range-ified

9il · 2016-07-22T16:31:08Z

Benchmarks please

JackStouffer · 2016-07-22T17:11:15Z

import std.stdio;
import std.datetime;
import std.conv;
import std.traits;
import std.utf;

enum testCount = 1_000_000;

string toUTF82(S)(S s) if (isSomeString!S)
{
    import std.array : array;
    return s.byChar.array;
}

void main()
{
    dstring asciiTest = "The quick brown fox jumped over the lazy dog";
    dstring utfTest = "Приве́т नमस्ते שָׁלוֹם";

    auto result  = to!Duration(benchmark!(() => toUTF8(asciiTest))(testCount)[0]);
    auto result2 = to!Duration(benchmark!(() => toUTF82(asciiTest))(testCount)[0]);
    auto result3 = to!Duration(benchmark!(() => toUTF8(utfTest))(testCount)[0]);
    auto result4 = to!Duration(benchmark!(() => toUTF82(utfTest))(testCount)[0]);

    writeln("ASCII w/ old", "\t", result);
    writeln("ASCII w/ new", "\t", result2);
    writeln("UTF w/ old", "\t", result3);
    writeln("UTF w/ new", "\t", result4);
}

$ ldc2 -O5 -release test.d && ./test
ASCII w/ old    162 ms, 44 μs, and 3 hnsecs
ASCII w/ new    877 ms and 814 μs
UTF w/ old  1 sec, 317 ms, 308 μs, and 1 hnsec
UTF w/ new  879 ms, 519 μs, and 6 hnsecs

Due to the lack of a fast track for ASCII in byUTF, this has a significant performance degradation for ASCII.

I will open a PR with a ASCII case in byUTF.

quickfur · 2016-07-22T17:12:46Z

ping @DmitryOlshansky
Please review, since you know this module best.

DmitryOlshansky · 2016-07-22T17:44:17Z

Then we have to add fast path for ASCII in byUTF

JackStouffer · 2016-07-22T19:43:36Z

Needs #4642

JackStouffer · 2016-07-22T20:12:42Z

Numbers with latest change

$ ldc2 -O5 -release test.d && ./test

ASCII w/ old    150 ms, 528 μs, and 8 hnsecs
ASCII w/ new    601 ms, 348 μs, and 5 hnsecs
UTF w/ old  1 sec, 327 ms, 231 μs, and 5 hnsecs
UTF w/ old  714 ms, 615 μs, and 3 hnsecs

JackStouffer · 2016-07-23T19:26:44Z

Interesting thing to note:

Having even a couple UTF characters in your string causes the performance of the original to drop like a rock. While the ASCII-only situation with the new code is significantly worse, having even a few unicode characters in your string will mean you will get a ~45% speed increase.

This benchmark uses the changes in #4642

$ ldc2 -O5 -release -boundscheck=off test.d && ./test

ASCII w/ old    143 ms, 211 μs, and 9 hnsecs
ASCII w/ new    522 ms and 566 μs
UTF w/ old      1 sec, 444 ms, 939 μs, and 9 hnsecs
UTF w/ old      674 ms, 635 μs, and 9 hnsecs
Mixed w/ old    1 sec, 343 ms, 200 μs, and 9 hnsecs
Mixed w/ new    678 ms, 695 μs, and 3 hnsecs

import std.stdio;
import std.datetime;
import std.conv;
import std.traits;
import std.utf;
import std.range;
import std.algorithm;

enum testCount = 1_000_000;

string toUTF82(S)(S s) if (isSomeString!S)
{
    import std.array : appender;
    auto app = appender!string();
    app.reserve(s.length);
    foreach (c; s.byUTF2!char)
        app.put(c);
    return app.data;
}

template byUTF2(C) if (isSomeChar!C)
{
    static if (!is(Unqual!C == C))
        alias byUTF2 = byUTF2!(Unqual!C);
    else:

    auto ref byUTF2(R)(R r)
        if (isAutodecodableString!R && isInputRange!R && isSomeChar!(ElementEncodingType!R))
    {
        return byUTF2(r.byCodeUnit());
    }

    auto ref byUTF2(R)(R r)
        if (!isAutodecodableString!R && isInputRange!R && isSomeChar!(ElementEncodingType!R))
    {
        alias RC = Unqual!(ElementEncodingType!R);

        static if (is(RC == C))
        {
            return r.byCodeUnit();
        }
        else
        {
            static struct Result
            {
                this(ref R r)
                {
                    this.r = r;
                }

                @property bool empty()
                {
                    return pos == fill && r.empty;
                }

                @property auto front()
                {
                    if (pos == fill)
                    {
                        pos = 0;
                        auto c = r.front;

                        if (c <= 0x7F)
                        {
                            buf[0] = cast(C) c;
                            fill = 1;
                            r.popFront;
                        }
                        else
                        {
                            static if (is(RC == dchar))
                            {
                                fill = cast(ushort) encode!(UseReplacementDchar.yes)(buf, r.front);
                                r.popFront;
                            }
                            else
                            {
                                fill = cast(ushort) encode!(UseReplacementDchar.yes)(
                                    buf, decodeFront!(UseReplacementDchar.yes)(r));
                            }
                        }
                    }
                    return buf[pos];
                }

                void popFront()
                {
                    if (pos == fill)
                        front;
                    ++pos;
                }

                static if (isForwardRange!R)
                {
                    @property auto save()
                    {
                        auto ret = this;
                        ret.r = r.save;
                        return ret;
                    }
                }

            private:

                R r;
                C[4 / C.sizeof] buf = void;
                ushort pos, fill;
            }

            return Result(r);
        }
    }
}

void main()
{
    dstring asciiTest = "The quick brown fox jumped over the lazy dog";
    dstring utfTest = "Приве́т नमस्ते שָׁלוֹם";
    dstring mixedTest = "The quick brown fox Приве́т jumped over the lazy dog";

    auto result  = to!Duration(benchmark!(() => toUTF8(asciiTest))(testCount)[0]);
    auto result2 = to!Duration(benchmark!(() => toUTF82(asciiTest))(testCount)[0]);

    auto result3 = to!Duration(benchmark!(() => toUTF8(utfTest))(testCount)[0]);
    auto result4 = to!Duration(benchmark!(() => toUTF82(utfTest))(testCount)[0]);

    auto result5 = to!Duration(benchmark!(() => toUTF8(mixedTest))(testCount)[0]);
    auto result6 = to!Duration(benchmark!(() => toUTF82(mixedTest))(testCount)[0]);

    writeln("ASCII w/ old", "\t", result);
    writeln("ASCII w/ new", "\t", result2);
    writeln("UTF w/ old", "\t", result3);
    writeln("UTF w/ old", "\t", result4);
    writeln("Mixed w/ old", "\t", result5);
    writeln("Mixed w/ new", "\t", result6);
}

JackStouffer · 2016-07-24T01:11:46Z

Will squish the commits if this is approved

codecov-io · 2016-07-28T20:30:15Z

Current coverage is 88.68% (diff: 66.66%)

Merging #4640 into master will increase coverage by <.01%

@@             master      #4640   diff @@
==========================================
  Files           121        121          
  Lines         73827      73810    -17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          65471      65457    -14   
+ Misses         8356       8353     -3   
  Partials          0          0

Powered by Codecov. Last update 6db08d3...ff96a5e

andralex · 2016-09-17T00:09:29Z

Auto-merge toggled on

andralex · 2016-09-17T00:09:47Z

@JackStouffer great, please finish

JackStouffer · 2016-09-17T13:24:07Z

Doc failure seems unrelated

JackStouffer force-pushed the utf branch from 149b96c to d07de80 Compare July 22, 2016 17:38

JackStouffer force-pushed the utf branch from f82ee55 to 2669fc7 Compare July 28, 2016 20:06

andralex added the pending autotester label Sep 17, 2016

JackStouffer added 6 commits September 17, 2016 14:18

Make std.utf.toUTF8 DRY by using byChar internally

0c550b7

Remove over applied pure attribute from std.utf

65c4648

use appender instead of array

2f24a0a

Added string specific logic

3cc0a7c

Range-ified

ed92b3d

Improved docs and added tests

e096f29

JackStouffer force-pushed the utf branch from 2669fc7 to e096f29 Compare September 17, 2016 13:20

andralex merged commit 76cd4bd into dlang:master Sep 17, 2016

JackStouffer deleted the utf branch May 22, 2017 17:29

Uh oh!

Make std.utf.toUTF8 DRY by using byUTF internally #4640

Make std.utf.toUTF8 DRY by using byUTF internally #4640

Uh oh!

Conversation

JackStouffer commented Jul 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

9il commented Jul 22, 2016

Uh oh!

JackStouffer commented Jul 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quickfur commented Jul 22, 2016

Uh oh!

DmitryOlshansky commented Jul 22, 2016

Uh oh!

JackStouffer commented Jul 22, 2016

Uh oh!

JackStouffer commented Jul 22, 2016

Uh oh!

JackStouffer commented Jul 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackStouffer commented Jul 24, 2016

Uh oh!

codecov-io commented Jul 28, 2016

Current coverage is 88.68% (diff: 66.66%)

Uh oh!

andralex commented Sep 17, 2016

Uh oh!

andralex commented Sep 17, 2016

Uh oh!

JackStouffer commented Sep 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JackStouffer commented Jul 22, 2016 •

edited

Loading

JackStouffer commented Jul 22, 2016 •

edited

Loading

JackStouffer commented Jul 23, 2016 •

edited

Loading