Skip to content

Conversation

@JackStouffer
Copy link
Contributor

Using the AST viewer on run.dlang.io, I noticed that foreach loops do a lot of extra copying that isn't always necessary. E.g. this

auto m = "test".takeExactly(3);
foreach (c; m)
{
}

immutable size = 5;
foreach (i; 0 .. size)
{
}

becomes

Result m = takeExactly("test", 3LU)
{
	Result __r71 = m;
	for (; !__r71.empty(); __r71.popFront())
	{
		dchar c = __r71.front();
	}
}

immutable immutable(int) size = 5;
{
	int __key72 = 0;
	int __limit73 = 5;
	for (; __key72 < __limit73; __key72 += 1)
	{
		int i = __key72;
	}
}

This PR changes some loops to be for loops without extra copying.

@dlang-bot
Copy link
Contributor

Thanks for your pull request, @JackStouffer!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

std/stdio.d Outdated
char[4] buf = void;
immutable len = encode(buf, c);
foreach (i ; 0 .. len)
immutable len = encode!(UseReplacementDchar.yes)(buf, c);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I changed this for two reasons,

  1. to be in-line with the lines above which don't throw
  2. DMD can forgo adding the exception-handling code if the function is marked as nothrow, which encode is when using replacement dchars

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be silently accepting/converting garbage! UseReplacementDchar.yes needs to be used with a lot of caution. If this does what I think it does (replace any kind of bad UTF with a single replacement character), then it's a bad idea. See https://issues.dlang.org/show_bug.cgi?id=14519

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I see you're just copying the above, which you've added in #5083 to replace toUtf8 which did the same thing, but I think we need to change this. Silently ignoring errors and throwing away information is never a good idea.

@JackStouffer JackStouffer force-pushed the lockingtextwriter-micro-optimization branch from 544bb7b to 0b1f00a Compare January 29, 2018 16:09
std/stdio.d Outdated
char[4] buf;
immutable size = encode!(UseReplacementDchar.yes)(buf, c);
foreach (i ; 0 .. size)
for (int i; i < size; ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is for really faster than foreach here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach makes three extra ints

    int size = 5;
    foreach (i; 0 .. size)
    {
    }
    
    for (int j = 0; j < size; ++j)
    {
    }

becomes

int size = 5;
{
	int __key46 = 0;
	int __limit47 = size;
	for (; __key46 < __limit47; __key46 += 1)
	{
		int i = __key46;
	}
}
{
	int j = 0;
	for (; j < size; j += 1)
	{
		{
		}
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that actually affect generated code, though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knowledge of ASM is limited, but it sure looks like it does

foreach

.text._Dmain    segment
    assume  CS:.text._Dmain
_Dmain:
        push    RBP
        mov RBP,RSP
        sub RSP,010h
        mov EAX,5
        mov -010h[RBP],EAX
        mov dword ptr -0Ch[RBP],0
        mov -8[RBP],EAX
L1A:        mov ECX,-0Ch[RBP]
        cmp ECX,-8[RBP]
        jge L2D
        mov EDX,-0Ch[RBP]
        mov -4[RBP],EDX
        inc dword ptr -0Ch[RBP]
        jmp short   L1A
L2D:        xor EAX,EAX
        leave
        ret
        add [RAX],AL
        add [RAX],AL
.text._Dmain    ends

for

.text._Dmain    segment
    assume  CS:.text._Dmain
_Dmain:
        push    RBP
        mov RBP,RSP
        sub RSP,010h
        mov dword ptr -8[RBP],5
        mov dword ptr -4[RBP],0
L16:        mov EAX,-4[RBP]
        cmp EAX,-8[RBP]
        jge L23
        inc dword ptr -4[RBP]
        jmp short   L16
L23:        xor EAX,EAX
        leave
        ret
        add 0[RCX],AL
.text._Dmain    ends

Copy link
Member

@quickfur quickfur Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the number of int declarations may not have any correlation with the actual number of stack variables allocated by the asm. Given the simplicity of the loops above, I'm expecting that even dmd would enregister the loop variables, so the argument about the number of int declarations isn't even relevant. Again, please show the asm to prove that the for-loop is better than the foreach loop. It's a step backwards in readability/maintainability, and I don't think we should do it unless it's a clear win across the board for dmd/gdc/ldc.

Copy link
Member

@CyberShadow CyberShadow Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quickfur , Jack posted some asm above, do you not see his comment? #6091 (comment)

@JackStouffer I'm guessing that's without -O? Phobos is built with -O -release, so we should be comparing disassemblies from code built with those flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must have been looking at a stale page when I wrote my comment.

But yeah, it does look like that's without -O. For performance comparisons dmd -O is really the bare minimum; I would actually prefer to see what gdc/ldc do with the same code, because they have better loop optimizations than dmd. (Though in the latter case the challenge becomes how to prevent overly-aggressive optimizations like in ldc, where the optimizer just runs the whole thing at compile-time and substitutes the answer directly. :-P)

std/stdio.d Outdated
char[4] buf = void;
immutable len = encode(buf, c);
foreach (i ; 0 .. len)
immutable len = encode!(UseReplacementDchar.yes)(buf, c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never be silently accepting/converting garbage! UseReplacementDchar.yes needs to be used with a lot of caution. If this does what I think it does (replace any kind of bad UTF with a single replacement character), then it's a bad idea. See https://issues.dlang.org/show_bug.cgi?id=14519

@JackStouffer JackStouffer force-pushed the lockingtextwriter-micro-optimization branch from 0b1f00a to 9a0e34a Compare January 29, 2018 17:20
@JackStouffer
Copy link
Contributor Author

@CyberShadow Removed the encoding change from this PR

@CyberShadow
Copy link
Member

Thanks! I did not object to it considering the circumstances, but I do think it's not a direction we should be moving towards.

for (; !writeme.empty; writeme.popFront)
{
put(c);
put(writeme.front);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this actually makes a difference in the generated asm? I'd like to see the asm output before approving this.

Using raw for-loops instead of foreach IMO is a step backward, and should not be done unless there's a very good reason for it. My impression is that a properly-optimizing compiler would eliminate the supposedly-redundant loads easily. (And if not, it's the compiler that should be fixed; the user code should not be uglified to work around a compiler deficiency.) Don't be so quick to believe intermediate pseudo-source code representations of the compiled code; always check the asm to see what's actually being executed at runtime.

Also, I'd like to see what gdc/ldc outputs for both the foreach loop and the raw for loop. Dmd's optimizer isn't the best when it comes to loop optimizations.

@JackStouffer
Copy link
Contributor Author

@quickfur @CyberShadow DMD properly optimizes the foreach (i; 0 .. 5) case. There's no difference in the ASM of it and the for loop.

However, with ranges DMD can't get rid of the copy because it would change the behavior of the code

With this code

import std.utf;
void main()
{
    int res;
    auto m = "test".byCodeUnit;
    for (; !m.empty; m.popFront)
    {
        res += m.front;
    }
}

This is produced

.text._Dmain    segment
    assume  CS:.text._Dmain
_Dmain:
        push    RBP
        mov RBP,RSP
        sub RSP,010h
        lea RDX,_TMP0@PC32[RIP]
        mov EDI,4
        mov RSI,RDX
        call      pure nothrow @nogc @safe std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[])@PLT32
        mov -010h[RBP],RAX
        mov -8[RBP],RDX
        lea RDI,-010h[RBP]
        call      const pure nothrow @property @nogc @safe bool std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.empty()@PLT32
        xor AL,1
        je  L50
L31:        lea RDI,-010h[RBP]
        call      inout pure nothrow ref @property @nogc @safe immutable(char) std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.front()@PLT32
        lea RDI,-010h[RBP]
        call      pure nothrow @nogc @safe void std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.popFront()@PLT32
        lea RDI,-010h[RBP]
        call      const pure nothrow @property @nogc @safe bool std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.empty()@PLT32
        xor AL,1
        jne L31
L50:        xor EAX,EAX
        mov RSP,RBP
        pop RBP
        ret
        add 0[RCX],AH
.text._Dmain    ends

but this code

import std.utf;
void main()
{
    int res;
    auto m = "test".byCodeUnit;
    foreach (c; m)
    {
        res += c;
    }
}

produces this

.text._Dmain    segment
    assume  CS:.text._Dmain
_Dmain:
        push    RBP
        mov RBP,RSP
        sub RSP,020h
        lea RDX,_TMP0@PC32[RIP]
        mov EDI,4
        mov RSI,RDX
        call      pure nothrow @nogc @safe std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[])@PLT32
        mov -010h[RBP],RAX
        mov -8[RBP],RDX
        mov RAX,-010h[RBP]
        mov -020h[RBP],RAX
        mov RCX,-8[RBP]
        mov -018h[RBP],RCX
        lea RDI,-020h[RBP]
        call      const pure nothrow @property @nogc @safe bool std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.empty()@PLT32
        xor AL,1
        je  L60
L41:        lea RDI,-020h[RBP]
        call      inout pure nothrow ref @property @nogc @safe immutable(char) std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.front()@PLT32
        lea RDI,-020h[RBP]
        call      pure nothrow @nogc @safe void std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.popFront()@PLT32
        lea RDI,-020h[RBP]
        call      const pure nothrow @property @nogc @safe bool std.utf.byCodeUnit!(immutable(char)[]).byCodeUnit(immutable(char)[]).ByCodeUnitImpl.empty()@PLT32
        xor AL,1
        jne L41
L60:        xor EAX,EAX
        mov RSP,RBP
        pop RBP
        ret
        add 0[RCX],AH
.text._Dmain    ends

@CyberShadow
Copy link
Member

However, with ranges DMD can't get rid of the copy because it would change the behavior of the code

That's less unexpected; my question was about the

-                        foreach (i ; 0 .. size)
+                        for (int i; i < size; ++i) 

change, which seemed excessive.

@JackStouffer JackStouffer force-pushed the lockingtextwriter-micro-optimization branch from 9a0e34a to 93129f5 Compare January 29, 2018 17:58
@JackStouffer
Copy link
Contributor Author

You know, I just noticed that loop does auto-decoding on all wide strings. Initially I thought that it was only operating on ranges. I'm going to close this and open a new PR which fixes both this issue and the wide string issue.

Thanks @CyberShadow and @quickfur for the review.

@JackStouffer JackStouffer deleted the lockingtextwriter-micro-optimization branch January 29, 2018 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants