Modified the behavior of std.stdio.LockingTextWriter.put #6092

JackStouffer · 2018-01-29T18:17:18Z

Removed code that did extra work by decoding wstrings to dchars, and then encoding them to chars. Or in the case of writing wstring to a wide file, put no long has to do any encoding or decoding at all, where as it did a bunch of extra work before.
Made the wchar path throw a UTF exception on bad data
Improved the docs for LockingTextWriter

CC @CyberShadow

dlang-bot · 2018-01-29T18:17:19Z

Thanks for your pull request, @JackStouffer!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

quickfur · 2018-01-29T18:44:31Z

std/stdio.d

                    auto result = trustedFwrite(file_._p.handle, writeme);
                    if (result != writeme.length) errnoEnforce(0);
                    return;
                }


Won't the else case (i.e., orientation_ > 0) have a bug, now that you made the original fallthrough code an else to the static if?

Hmm. Can't say I'm a fan of the copy-pasta. What about inserting a return for the isSomeString!A && C.sizeof > 1 case, and getting rid of the else after that, so that we can merge these two identical for-loops? I feel bad for nitpicking on this level, but the code duplication seems unwarranted in this case.

quickfur · 2018-01-29T18:53:48Z

std/stdio.d

+                for (; !writeme.empty; writeme.popFront)
+                {
+                    put(writeme.front);
+                }


This seems to be a strange way to write the loop. What's wrong with just foreach (e; writeme) put(e);?

And on that note, why not just use foreach (c; writeme) put(e); for both branches of the static if? In which case we could just fold it into a single generic loop. AFAICT, the only real difference is that we have foreach (c; writeme) rather than foreach (Elem c; writeme) which would force auto-decoding in the case of a wide string. Getting rid of the original stuff surrounding Elem should take care of that.

(Though granted, the difference is rather subtle; a comment explaining this might be in order.)

This seems to be a strange way to write the loop. What's wrong with just

See the ASM here: #6091 (comment)

An unnecessary copy of writeme is made regardless of optimizations.

Ah I see. Does gdc/ldc optimize that out? It seems kinda sad that the construct that's supposed to natively support ranges would also introduce such blatant inefficiencies.

(Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well. But the optimizer should be smart enough, at least in theory, to notice that the original value of the range argument isn't needed, so the copy could be eliminated.)

Does gdc/ldc optimize that out?

I can't make heads or tails of ldc's ASM output. You can see it here https://run.dlang.io/is/ck7lsp

I'm going to assume "no", because there are cases where not making a copy could have subtle behavior changes.

Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well.

Yeah that's the idea, and it makes sense for user code to always use foreach regardless for simplicity's sake.

Haha, it seems ldc's aggressive optimizer has defeated us once again. You might want to put the code into a public function that returns a value, because ldc's optimizer has apparently noticed that the value computed by the loop is never used, so it has completely elided the entire body of main() to just return 0. :-P

(And I just tried changing the code to return the value computed by the loop: again, ldc's optimizer is very aggressive and decided to execute the loop at compile-time, so main() just returns the precomputed value directly without any computation.)

OK, the only way I could convince ldc not to replace the function body with the compile-time evaluated result is to make the range a function parameter, with a forced instantiation via an alias: here.

The function body uses a whole bunch of fancy XMM vectorized operation setup, then a very tight loop summing the values:

.cfi_startproc testq %rdi, %rdi je .LBB14_1 cmpq $8, %rdi jae .LBB14_4 xorl %eax, %eax jmp .LBB14_12 .LBB14_1: xorl %eax, %eax retq .LBB14_4: movq %rdi, %r8 andq $-8, %r8 leaq -8(%r8), %rdx shrq $3, %rdx leal 1(%rdx), %eax andl $1, %eax testq %rdx, %rdx je .LBB14_5 leaq -1(%rax), %rcx subq %rdx, %rcx pxor %xmm2, %xmm2 xorl %edx, %edx pxor %xmm0, %xmm0 pxor %xmm1, %xmm1 .p2align 4, 0x90 .LBB14_7: movd (%rsi,%rdx), %xmm3 movd 4(%rsi,%rdx), %xmm4 punpcklbw %xmm2, %xmm3 punpcklwd %xmm2, %xmm3 paddd %xmm0, %xmm3 punpcklbw %xmm2, %xmm4 punpcklwd %xmm2, %xmm4 paddd %xmm1, %xmm4 movd 8(%rsi,%rdx), %xmm0 movd 12(%rsi,%rdx), %xmm1 punpcklbw %xmm2, %xmm0 punpcklwd %xmm2, %xmm0 paddd %xmm3, %xmm0 punpcklbw %xmm2, %xmm1 punpcklwd %xmm2, %xmm1 paddd %xmm4, %xmm1 addq $16, %rdx addq $2, %rcx jne .LBB14_7 testq %rax, %rax je .LBB14_10 .LBB14_9: movd 4(%rsi,%rdx), %xmm2 pxor %xmm3, %xmm3 punpcklbw %xmm3, %xmm2 punpcklwd %xmm3, %xmm2 paddd %xmm2, %xmm1 movd (%rsi,%rdx), %xmm2 punpcklbw %xmm3, %xmm2 punpcklwd %xmm3, %xmm2 paddd %xmm2, %xmm0 .LBB14_10: paddd %xmm1, %xmm0 pshufd $78, %xmm0, %xmm1 paddd %xmm0, %xmm1 pshufd $229, %xmm1, %xmm0 paddd %xmm1, %xmm0 movd %xmm0, %eax cmpq %r8, %rdi je .LBB14_13 subq %r8, %rdi addq %r8, %rsi .p2align 4, 0x90 .LBB14_12: movzbl (%rsi), %ecx addl %ecx, %eax incq %rsi decq %rdi jne .LBB14_12 .LBB14_13: retq .LBB14_5: pxor %xmm0, %xmm0 xorl %edx, %edx pxor %xmm1, %xmm1 testq %rax, %rax jne .LBB14_9 jmp .LBB14_10

This was with a foreach loop. With a manual for loop, I get:

.cfi_startproc testq %rdi, %rdi je .LBB14_1 cmpq $8, %rdi jae .LBB14_4 xorl %eax, %eax jmp .LBB14_12 .LBB14_1: xorl %eax, %eax retq .LBB14_4: movq %rdi, %r8 andq $-8, %r8 leaq -8(%r8), %rdx shrq $3, %rdx leal 1(%rdx), %eax andl $1, %eax testq %rdx, %rdx je .LBB14_5 leaq -1(%rax), %rcx subq %rdx, %rcx pxor %xmm2, %xmm2 xorl %edx, %edx pxor %xmm0, %xmm0 pxor %xmm1, %xmm1 .p2align 4, 0x90 .LBB14_7: movd (%rsi,%rdx), %xmm3 movd 4(%rsi,%rdx), %xmm4 punpcklbw %xmm2, %xmm3 punpcklwd %xmm2, %xmm3 paddd %xmm0, %xmm3 punpcklbw %xmm2, %xmm4 punpcklwd %xmm2, %xmm4 paddd %xmm1, %xmm4 movd 8(%rsi,%rdx), %xmm0 movd 12(%rsi,%rdx), %xmm1 punpcklbw %xmm2, %xmm0 punpcklwd %xmm2, %xmm0 paddd %xmm3, %xmm0 punpcklbw %xmm2, %xmm1 punpcklwd %xmm2, %xmm1 paddd %xmm4, %xmm1 addq $16, %rdx addq $2, %rcx jne .LBB14_7 testq %rax, %rax je .LBB14_10 .LBB14_9: movd 4(%rsi,%rdx), %xmm2 pxor %xmm3, %xmm3 punpcklbw %xmm3, %xmm2 punpcklwd %xmm3, %xmm2 paddd %xmm2, %xmm1 movd (%rsi,%rdx), %xmm2 punpcklbw %xmm3, %xmm2 punpcklwd %xmm3, %xmm2 paddd %xmm2, %xmm0 .LBB14_10: paddd %xmm1, %xmm0 pshufd $78, %xmm0, %xmm1 paddd %xmm0, %xmm1 pshufd $229, %xmm1, %xmm0 paddd %xmm1, %xmm0 movd %xmm0, %eax cmpq %r8, %rdi je .LBB14_13 subq %r8, %rdi addq %r8, %rsi .p2align 4, 0x90 .LBB14_12: movzbl (%rsi), %ecx addl %ecx, %eax incq %rsi decq %rdi jne .LBB14_12 .LBB14_13: retq .LBB14_5: pxor %xmm0, %xmm0 xorl %edx, %edx pxor %xmm1, %xmm1 testq %rax, %rax jne .LBB14_9 jmp .LBB14_10 .Lfunc_end14:

It's too long to go through manually, but a diff of the two asm blocks showed that they are instruction-for-instruction identical. IOW, changing from foreach to for did not make any difference to ldc.

JackStouffer · 2018-01-29T19:58:51Z

std/stdio.d(2891): Warning: statement is not reachable

Damn issue 14835!

JackStouffer · 2018-01-29T19:59:59Z

Had to go back to the non DRY version

quickfur · 2018-01-29T20:10:55Z

So basically, with ldc there is no difference between the foreach and for versions of the code, at least as far as string.byCodeUnit as argument is concerned. I suppose if you want to be really thorough, you could try compiling it with a custom range type to see how it does. In general, given what I've seen of ldc's asm output, I'm not expecting there's going to be a difference, except where the range type actually does something different when it's copied, which is a case that I don't think we need to worry about if the concern is performance.

So this boils down to whether we feel it's worth uglifying Phobos for the sake of dmd, or leave it up to better optimizers like ldc/gdc to do the dirty work for us.

quickfur · 2018-01-29T20:12:49Z

P.S., though obviously, eliminating the needless decoding of wide strings to dchar should probably still be done. I'm not sure if even ldc's optimizer is smart enough to do that for us. So at the very least, we should still check that in.

JackStouffer · 2018-01-29T20:19:41Z

I'm not sure if even ldc's optimizer is smart enough to do that for us.

Not possible. It's done explicitly by the foreach in the past code.

wilzbach · 2018-01-29T21:47:24Z

Fighting the optimizer: benchmarks

@JackStouffer @quickfur I noticed that you two were fighting the optimizer lately, but haven't had time to reply. There are two trick to fight the optimizer for benchmarks: __gshared + doNotOptimize

private void doNotOptimizeAway(T)(auto ref T t)
{
    import core.thread : getpid;
    import core.stdc.stdio;
    if(getpid() == 1) {
        printf("%s", *cast(char*)&t);
    }
}

Example: https://gist.github.com/wilzbach/3407d80bfa757d46a3ac59a873d5f085

I actually tried to push it once (#5416) and we should really try to standardize this - it's the most common pitfall when benchmarking.

(There's also the trick of using separate compilation to fight the optimizer, but that's usually more work (and harder to share because it's not a single file anymore), so I'm not going to suggest it. However, if you want to see this trick in practice, see https://github.com/wilzbach/d-benchmarks).

Fighting the optimizer: asm

If you look at the ASM there are two tricks: printf + using (unknown) function arguments
The optimizer can't kill that.
It's as sample as:

void bar(string foo)
{
    import std.utf;
    import core.stdc.stdio;
    auto m = foo.byCodeUnit;
    foreach (c; m) {
        printf("%c", c);
    }
}

which produces:

void example.bar(immutable(char)[]):
  push r14
  push rbx
  push rax
  mov rbx, rsi
  mov r14, rdi
  test r14, r14
  je .LBB0_2
.LBB0_1:
  movzx edi, byte ptr [rbx]
  call putchar@PLT ; <- inner loop statement
  inc rbx
  dec r14
  jne .LBB0_1 ; <- GOTO of the loop
.LBB0_2:
  add rsp, 8
  pop rbx
  pop r14
  ret

So both byCodeUnit and foreach are completely optimized away by LDC, pretty cool huh?

So this boils down to whether we feel it's worth uglifying Phobos for the sake of dmd, or leave it up to better optimizers like ldc/gdc to do the dirty work for us.

I vote against. Any performance critical application will use ldc or gdc. dmd code is slow as a snail compared to LDC anyhow so these few things don't help anyways.

(OT: I wonder whether we finally reached the point to make LDC the default compiler and simply put all dev efforts into one well-maintained backend, but that's a discussion for another day)

quickfur · 2018-01-29T22:07:07Z

Now that @WalterBright has been freed from being hand-tied by the dmd backend license, I wonder if we can convince him to either improve dmd's optimizer, or direct his efforts towards either the ldc or gdc backends.

To be fair to Walter, dmd's optimizer is not bad when you're not hard-pressed to squeeze every last CPU cycle out of your program. Where it fails to deliver are (1) overly conservative inlining, leading to missed opportunities of the domino effect of one inlining leading to further optimization opportunities, and (2) loop unrolling, which in many common cases also open up further optimization opportunities down the road. Besides loop unrolling, dmd's loop optimizations are actually pretty decent. But in both cases, what ultimately causes dmd to produce suboptimal code is the many missed opportunities to make a locally-small optimization that nevertheless would have led to bigger optimizations.

As long as this situation continues to obtain, I will continue to stick by my recommendation that performance-sensitive code (and by extension, performance measurements and optimization decisions) should be compiled with gdc/ldc, and dmd should only be used if performance is not critical and/or you absolutely need the latest bleeding-edge language features.

quickfur · 2018-01-29T22:12:21Z

P.S. Not to mention that both gdc and ldc are able to target far more architectures than dmd, so if you're working with non-x86 architectures, dmd is not even a choice. And with our current manpower, I can't see wider arch support in dmd anytime in the foreseeable future (not to mention that it would be a huge waste of energy to duplicate what others have already done, energy that could be better spent elsewhere).

jondegenhardt · 2018-01-29T22:44:58Z

My recent performance tests with LDC's LTO and PGO point to the quality of the optimization work coming out the LLVM community. By all accounts, the same is true of GCC as well. It certainly makes sense to target LDC/GDC for performance benchmarks.

It'd be worth having more tests than just mine, but.... My tests suggest it might be worth targeting a common setup where LDC's LTO was a build option, setting up LTO to encompass druntime, phobos, and DUB dependencies in addition to the application code. The big win in such a setup might simply be better inlining across all the module boundaries, but this appears to be a significant win.

wilzbach

OK with doc + encode change, but not okay with the foreach loop changes. As mentioned earlier, please only do such changes if you have looked at LDC's assembly (though the LDC people are pretty fast in fixing bugs/missed optimizations if you really identify one).

wilzbach · 2018-01-29T21:56:37Z

std/stdio.d

                    {
                        char[4] buf;
-                        immutable size = encode!(UseReplacementDchar.yes)(buf, c);
+                        immutable size = encode(buf, c);


I saw there was some discussions at #6091 (comment) (for the future - it would be great if you could either re-use the PR or reference such discussions at the releveant at the new PR as it took me a while to find the discussions and I almost missed it).

The annoyance that auto-decoding makes strings neither nothrow nor @nogc (though the undocumented -dip1008 for `@nogc~ exceptions has been merged and might soon be enabled by default) should make us really put an end to the auto-decoding hell.

FWIW there has been DIP76 once:

When the autodecoder encounters invalid UTF, it throws an exception. This proposal changes that so it replaces the invalid UTF with the Replacement Character.

I guess that one never went anywhere :/

Anyhow I'm still not sure I understand the motivation for this change.
Also it could have resulted in a breaking change as this could have changed the function from nothrow to throws.
It doesn't due to the nasty _handle having it's attributes specified manually and thus not being nothrow nor @nogc:

phobos/std/stdio.d

Line 2825 in afa7d58

@property _iobuf* handle_() @trusted { return cast(_iobuf*) file_._p.handle; }

but I would like to see such things pointed out by the submitter and not needing to dig them out myself :/
There can also be a case made that the dchar version doesn't use the UseReplacementDchar version, but again the one proposing the change should bring up the arguments for it.

Maybe it's time for us to make another big push to kill autodecoding for good. It's been almost universally hated by all, and while we did finally manage to convince Andrei the last time round that it's actually a bad thing (after many years of interminable arguments), we didn't go far enough, IMO -- we only got as far as introducing byCodeUnit, et al, and to try to reduce reliance on autodecoding in various Phobos functions.

But it has been an ongoing source of performance concerns, and now nothrow and @nogc unfriendliness, that perhaps we can make a very strong case for getting rid of it forever. I, for one, plan to celebrate the day the last of autodecoding is finally killed off from Phobos code forever.

@quickfur Did you see this comment #6073 (comment) ?

RCStr has been talked about since, what, a year ago or more? When is it ever going to materialize? What's the holdup?

@quickfur DIP1000. I stalled Walter's PRs for it because I believed it wasn't handling GC-ed objects correctly (and Martin agreed #5052 (comment)). There's been no work on DIP1000 since my comments on the matter so I think Walter dropped it for some reason.

Here's a list of what modules aren't compatible with DIP1000 as of now: https://github.com/dlang/phobos/projects/6

There has been some recent work on DIP1000 at #5915 and #6041. The first one stalled because DIP1000 uses a different mangling :/

There has been some recent work on DIP1000 at #5915 and #6041

Ok, there's been no work by Walter since

While dip1000 isn't perfect by any stretch, I still would welcome it as an improvement on the current situation.

wilzbach · 2018-01-29T22:14:51Z

std/stdio.d

+                // put each element in turn.
+                for (; !writeme.empty; writeme.popFront)
+                {
+                    put(writeme.front);


No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.

wilzbach · 2018-01-29T22:22:18Z

std/stdio.d

-            }

-            // put each element in turn.
-            alias Elem = Unqual!(ElementType!A);


That's a welcome change!

wilzbach · 2018-01-29T22:42:27Z

std/stdio.d

-            alias Elem = Unqual!(ElementType!A);
-            foreach (Elem c; writeme)
+                // put each element in turn.
+                for (; !writeme.empty; writeme.popFront)


No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.

Have a look at what LDC is doing: https://godbolt.org/g/HuhD6T
It even inlines put completely:

void example.put(immutable(char)[]): mov rcx, qword ptr [rip + int example.put(immutable(char)[]).__foreachbody2(ref dchar)@GOTPCREL] xor edx, edx jmp _aApplycd1@PLT int example.put(immutable(char)[]).__foreachbody2(ref dchar): push r14 push rbx push rax mov edi, dword ptr [rsi] cmp edi, 127 ja .LBB1_2 xor esi, esi call fputc_unlocked@PLT jmp .LBB1_6 .LBB1_2: lea rsi, [rsp + 4] call _D3std3utf__T6encodeVEQu8typecons__T4FlagVAyaa19_7573655265706c6163656d656e744463686172ZQCai0ZQDdFNaNfJG4awZm@PLT mov r14, rax test r14, r14 je .LBB1_6 xor ebx, ebx .LBB1_4: cmp rbx, 3 ja .LBB1_7 movzx edi, byte ptr [rsp + rbx + 4] xor esi, esi call fputc_unlocked@PLT inc rbx cmp rbx, r14 jb .LBB1_4 .LBB1_6: xor eax, eax add rsp, 8 pop rbx pop r14 ret .LBB1_7: lea rsi, [rip + .L.str] mov edi, 63 mov edx, 39 call _d_arraybounds@PLT

CyberShadow · 2018-01-29T22:58:50Z

Before too much time is invested in this particular matter, is this even a case where performance matters? This is code that deals with IO and libc, so the for loops here (no matter how they're optimized) are not likely to be a bottleneck, are they?

JackStouffer · 2018-01-30T00:04:49Z

@wilzbach I've stated my (long) disagreements with dip1008 in the NG. It's a change we will come to regret.

Technically, I'm not introducing any new throwing code because all wstrings with bad data already threw in the old code do to the explict auto-decoding. So, you could say I'm keeping existing behavior 😃

I'll remove the foreach changes.

4f6b6fd

* Removed code that caused wstrings to decode to dchars, and then encoded to chars * Made the wchar path throw a UTF exception on bad data * Improved the docs for LockingTextWriter

JackStouffer requested review from CyberShadow and schveiguy as code owners January 29, 2018 18:17

JackStouffer force-pushed the lockingtextwriter-put branch from 417c8c5 to 99e7478 Compare January 29, 2018 18:38

quickfur reviewed Jan 29, 2018

View reviewed changes

JackStouffer force-pushed the lockingtextwriter-put branch 2 times, most recently from 68f1ba7 to a931945 Compare January 29, 2018 19:30

JackStouffer force-pushed the lockingtextwriter-put branch from a931945 to 5795762 Compare January 29, 2018 19:59

wilzbach previously requested changes Jan 29, 2018

View reviewed changes

JackStouffer force-pushed the lockingtextwriter-put branch from 5795762 to 4f6b6fd Compare January 30, 2018 13:30

wilzbach approved these changes Jan 30, 2018

View reviewed changes

wilzbach added the Merge:auto-merge label Jan 30, 2018

Modified the behavior of std.stdio.LockingTextWriter.put

88c1e4f

* Removed code that caused wstrings to decode to dchars, and then encoded to chars * Made the wchar path throw a UTF exception on bad data * Improved the docs for LockingTextWriter

JackStouffer force-pushed the lockingtextwriter-put branch from 4f6b6fd to 88c1e4f Compare January 30, 2018 18:08

dlang-bot removed the Merge:auto-merge label Jan 30, 2018

JackStouffer added Severity:Optimization Merge:auto-merge labels Jan 30, 2018

wilzbach mentioned this pull request Jan 30, 2018

Fix issue 13632: Enhancement to std.string.strip #6023

Merged

dlang-bot merged commit a267043 into dlang:master Jan 30, 2018

JackStouffer deleted the lockingtextwriter-put branch January 30, 2018 20:03

Uh oh!

Modified the behavior of std.stdio.LockingTextWriter.put #6092

Modified the behavior of std.stdio.LockingTextWriter.put #6092

Uh oh!

Conversation

JackStouffer commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlang-bot commented Jan 29, 2018

Bugzilla references

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackStouffer commented Jan 29, 2018

Uh oh!

JackStouffer commented Jan 29, 2018

Uh oh!

quickfur commented Jan 29, 2018

Uh oh!

quickfur commented Jan 29, 2018

Uh oh!

JackStouffer commented Jan 29, 2018

Uh oh!

wilzbach commented Jan 29, 2018

Fighting the optimizer: benchmarks

Fighting the optimizer: asm

Uh oh!

quickfur commented Jan 29, 2018

Uh oh!

quickfur commented Jan 29, 2018

Uh oh!

jondegenhardt commented Jan 29, 2018

Uh oh!

wilzbach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyberShadow commented Jan 29, 2018

Uh oh!

JackStouffer commented Jan 29, 2018 •

edited

Loading