Skip to content

Conversation

@JackStouffer
Copy link
Contributor

@JackStouffer JackStouffer commented Jan 29, 2018

  • Removed code that did extra work by decoding wstrings to dchars, and then encoding them to chars. Or in the case of writing wstring to a wide file, put no long has to do any encoding or decoding at all, where as it did a bunch of extra work before.
  • Made the wchar path throw a UTF exception on bad data
  • Improved the docs for LockingTextWriter

CC @CyberShadow

@dlang-bot
Copy link
Contributor

Thanks for your pull request, @JackStouffer!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

@JackStouffer JackStouffer force-pushed the lockingtextwriter-put branch from 417c8c5 to 99e7478 Compare January 29, 2018 18:38
auto result = trustedFwrite(file_._p.handle, writeme);
if (result != writeme.length) errnoEnforce(0);
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't the else case (i.e., orientation_ > 0) have a bug, now that you made the original fallthrough code an else to the static if?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Can't say I'm a fan of the copy-pasta. What about inserting a return for the isSomeString!A && C.sizeof > 1 case, and getting rid of the else after that, so that we can merge these two identical for-loops? I feel bad for nitpicking on this level, but the code duplication seems unwarranted in this case.

std/stdio.d Outdated
for (; !writeme.empty; writeme.popFront)
{
put(writeme.front);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a strange way to write the loop. What's wrong with just foreach (e; writeme) put(e);?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on that note, why not just use foreach (c; writeme) put(e); for both branches of the static if? In which case we could just fold it into a single generic loop. AFAICT, the only real difference is that we have foreach (c; writeme) rather than foreach (Elem c; writeme) which would force auto-decoding in the case of a wide string. Getting rid of the original stuff surrounding Elem should take care of that.

(Though granted, the difference is rather subtle; a comment explaining this might be in order.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a strange way to write the loop. What's wrong with just

See the ASM here: #6091 (comment)

An unnecessary copy of writeme is made regardless of optimizations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Does gdc/ldc optimize that out? It seems kinda sad that the construct that's supposed to natively support ranges would also introduce such blatant inefficiencies.

(Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well. But the optimizer should be smart enough, at least in theory, to notice that the original value of the range argument isn't needed, so the copy could be eliminated.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does gdc/ldc optimize that out?

I can't make heads or tails of ldc's ASM output. You can see it here https://run.dlang.io/is/ck7lsp

I'm going to assume "no", because there are cases where not making a copy could have subtle behavior changes.

Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well.

Yeah that's the idea, and it makes sense for user code to always use foreach regardless for simplicity's sake.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it seems ldc's aggressive optimizer has defeated us once again. You might want to put the code into a public function that returns a value, because ldc's optimizer has apparently noticed that the value computed by the loop is never used, so it has completely elided the entire body of main() to just return 0. :-P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And I just tried changing the code to return the value computed by the loop: again, ldc's optimizer is very aggressive and decided to execute the loop at compile-time, so main() just returns the precomputed value directly without any computation.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the only way I could convince ldc not to replace the function body with the compile-time evaluated result is to make the range a function parameter, with a forced instantiation via an alias: here.

The function body uses a whole bunch of fancy XMM vectorized operation setup, then a very tight loop summing the values:

.cfi_startproc
	testq	%rdi, %rdi
	je	.LBB14_1
	cmpq	$8, %rdi
	jae	.LBB14_4
	xorl	%eax, %eax
	jmp	.LBB14_12
.LBB14_1:
	xorl	%eax, %eax
	retq
.LBB14_4:
	movq	%rdi, %r8
	andq	$-8, %r8
	leaq	-8(%r8), %rdx
	shrq	$3, %rdx
	leal	1(%rdx), %eax
	andl	$1, %eax
	testq	%rdx, %rdx
	je	.LBB14_5
	leaq	-1(%rax), %rcx
	subq	%rdx, %rcx
	pxor	%xmm2, %xmm2
	xorl	%edx, %edx
	pxor	%xmm0, %xmm0
	pxor	%xmm1, %xmm1
	.p2align	4, 0x90
.LBB14_7:
	movd	(%rsi,%rdx), %xmm3
	movd	4(%rsi,%rdx), %xmm4
	punpcklbw	%xmm2, %xmm3
	punpcklwd	%xmm2, %xmm3
	paddd	%xmm0, %xmm3
	punpcklbw	%xmm2, %xmm4
	punpcklwd	%xmm2, %xmm4
	paddd	%xmm1, %xmm4
	movd	8(%rsi,%rdx), %xmm0
	movd	12(%rsi,%rdx), %xmm1
	punpcklbw	%xmm2, %xmm0
	punpcklwd	%xmm2, %xmm0
	paddd	%xmm3, %xmm0
	punpcklbw	%xmm2, %xmm1
	punpcklwd	%xmm2, %xmm1
	paddd	%xmm4, %xmm1
	addq	$16, %rdx
	addq	$2, %rcx
	jne	.LBB14_7
	testq	%rax, %rax
	je	.LBB14_10
.LBB14_9:
	movd	4(%rsi,%rdx), %xmm2
	pxor	%xmm3, %xmm3
	punpcklbw	%xmm3, %xmm2
	punpcklwd	%xmm3, %xmm2
	paddd	%xmm2, %xmm1
	movd	(%rsi,%rdx), %xmm2
	punpcklbw	%xmm3, %xmm2
	punpcklwd	%xmm3, %xmm2
	paddd	%xmm2, %xmm0
.LBB14_10:
	paddd	%xmm1, %xmm0
	pshufd	$78, %xmm0, %xmm1
	paddd	%xmm0, %xmm1
	pshufd	$229, %xmm1, %xmm0
	paddd	%xmm1, %xmm0
	movd	%xmm0, %eax
	cmpq	%r8, %rdi
	je	.LBB14_13
	subq	%r8, %rdi
	addq	%r8, %rsi
	.p2align	4, 0x90
.LBB14_12:
	movzbl	(%rsi), %ecx
	addl	%ecx, %eax
	incq	%rsi
	decq	%rdi
	jne	.LBB14_12
.LBB14_13:
	retq
.LBB14_5:
	pxor	%xmm0, %xmm0
	xorl	%edx, %edx
	pxor	%xmm1, %xmm1
	testq	%rax, %rax
	jne	.LBB14_9
	jmp	.LBB14_10

This was with a foreach loop. With a manual for loop, I get:

.cfi_startproc
	testq	%rdi, %rdi
	je	.LBB14_1
	cmpq	$8, %rdi
	jae	.LBB14_4
	xorl	%eax, %eax
	jmp	.LBB14_12
.LBB14_1:
	xorl	%eax, %eax
	retq
.LBB14_4:
	movq	%rdi, %r8
	andq	$-8, %r8
	leaq	-8(%r8), %rdx
	shrq	$3, %rdx
	leal	1(%rdx), %eax
	andl	$1, %eax
	testq	%rdx, %rdx
	je	.LBB14_5
	leaq	-1(%rax), %rcx
	subq	%rdx, %rcx
	pxor	%xmm2, %xmm2
	xorl	%edx, %edx
	pxor	%xmm0, %xmm0
	pxor	%xmm1, %xmm1
	.p2align	4, 0x90
.LBB14_7:
	movd	(%rsi,%rdx), %xmm3
	movd	4(%rsi,%rdx), %xmm4
	punpcklbw	%xmm2, %xmm3
	punpcklwd	%xmm2, %xmm3
	paddd	%xmm0, %xmm3
	punpcklbw	%xmm2, %xmm4
	punpcklwd	%xmm2, %xmm4
	paddd	%xmm1, %xmm4
	movd	8(%rsi,%rdx), %xmm0
	movd	12(%rsi,%rdx), %xmm1
	punpcklbw	%xmm2, %xmm0
	punpcklwd	%xmm2, %xmm0
	paddd	%xmm3, %xmm0
	punpcklbw	%xmm2, %xmm1
	punpcklwd	%xmm2, %xmm1
	paddd	%xmm4, %xmm1
	addq	$16, %rdx
	addq	$2, %rcx
	jne	.LBB14_7
	testq	%rax, %rax
	je	.LBB14_10
.LBB14_9:
	movd	4(%rsi,%rdx), %xmm2
	pxor	%xmm3, %xmm3
	punpcklbw	%xmm3, %xmm2
	punpcklwd	%xmm3, %xmm2
	paddd	%xmm2, %xmm1
	movd	(%rsi,%rdx), %xmm2
	punpcklbw	%xmm3, %xmm2
	punpcklwd	%xmm3, %xmm2
	paddd	%xmm2, %xmm0
.LBB14_10:
	paddd	%xmm1, %xmm0
	pshufd	$78, %xmm0, %xmm1
	paddd	%xmm0, %xmm1
	pshufd	$229, %xmm1, %xmm0
	paddd	%xmm1, %xmm0
	movd	%xmm0, %eax
	cmpq	%r8, %rdi
	je	.LBB14_13
	subq	%r8, %rdi
	addq	%r8, %rsi
	.p2align	4, 0x90
.LBB14_12:
	movzbl	(%rsi), %ecx
	addl	%ecx, %eax
	incq	%rsi
	decq	%rdi
	jne	.LBB14_12
.LBB14_13:
	retq
.LBB14_5:
	pxor	%xmm0, %xmm0
	xorl	%edx, %edx
	pxor	%xmm1, %xmm1
	testq	%rax, %rax
	jne	.LBB14_9
	jmp	.LBB14_10
.Lfunc_end14:

It's too long to go through manually, but a diff of the two asm blocks showed that they are instruction-for-instruction identical. IOW, changing from foreach to for did not make any difference to ldc.

@JackStouffer JackStouffer force-pushed the lockingtextwriter-put branch 2 times, most recently from 68f1ba7 to a931945 Compare January 29, 2018 19:30
@JackStouffer
Copy link
Contributor Author

std/stdio.d(2891): Warning: statement is not reachable

Damn issue 14835!

@JackStouffer JackStouffer force-pushed the lockingtextwriter-put branch from a931945 to 5795762 Compare January 29, 2018 19:59
@JackStouffer
Copy link
Contributor Author

Had to go back to the non DRY version

@quickfur
Copy link
Member

So basically, with ldc there is no difference between the foreach and for versions of the code, at least as far as string.byCodeUnit as argument is concerned. I suppose if you want to be really thorough, you could try compiling it with a custom range type to see how it does. In general, given what I've seen of ldc's asm output, I'm not expecting there's going to be a difference, except where the range type actually does something different when it's copied, which is a case that I don't think we need to worry about if the concern is performance.

So this boils down to whether we feel it's worth uglifying Phobos for the sake of dmd, or leave it up to better optimizers like ldc/gdc to do the dirty work for us.

@quickfur
Copy link
Member

P.S., though obviously, eliminating the needless decoding of wide strings to dchar should probably still be done. I'm not sure if even ldc's optimizer is smart enough to do that for us. So at the very least, we should still check that in.

@JackStouffer
Copy link
Contributor Author

I'm not sure if even ldc's optimizer is smart enough to do that for us.

Not possible. It's done explicitly by the foreach in the past code.

@wilzbach
Copy link
Contributor

Fighting the optimizer: benchmarks

@JackStouffer @quickfur I noticed that you two were fighting the optimizer lately, but haven't had time to reply. There are two trick to fight the optimizer for benchmarks: __gshared + doNotOptimize

private void doNotOptimizeAway(T)(auto ref T t)
{
    import core.thread : getpid;
    import core.stdc.stdio;
    if(getpid() == 1) {
        printf("%s", *cast(char*)&t);
    }
}

Example: https://gist.github.com/wilzbach/3407d80bfa757d46a3ac59a873d5f085

I actually tried to push it once (#5416) and we should really try to standardize this - it's the most common pitfall when benchmarking.

(There's also the trick of using separate compilation to fight the optimizer, but that's usually more work (and harder to share because it's not a single file anymore), so I'm not going to suggest it. However, if you want to see this trick in practice, see https://github.com/wilzbach/d-benchmarks).

Fighting the optimizer: asm

If you look at the ASM there are two tricks: printf + using (unknown) function arguments
The optimizer can't kill that.
It's as sample as:

void bar(string foo)
{
    import std.utf;
    import core.stdc.stdio;
    auto m = foo.byCodeUnit;
    foreach (c; m) {
        printf("%c", c);
    }
}

which produces:

void example.bar(immutable(char)[]):
  push r14
  push rbx
  push rax
  mov rbx, rsi
  mov r14, rdi
  test r14, r14
  je .LBB0_2
.LBB0_1:
  movzx edi, byte ptr [rbx]
  call putchar@PLT ; <- inner loop statement
  inc rbx
  dec r14
  jne .LBB0_1 ; <- GOTO of the loop
.LBB0_2:
  add rsp, 8
  pop rbx
  pop r14
  ret

So both byCodeUnit and foreach are completely optimized away by LDC, pretty cool huh?

So this boils down to whether we feel it's worth uglifying Phobos for the sake of dmd, or leave it up to better optimizers like ldc/gdc to do the dirty work for us.

I vote against. Any performance critical application will use ldc or gdc. dmd code is slow as a snail compared to LDC anyhow so these few things don't help anyways.

(OT: I wonder whether we finally reached the point to make LDC the default compiler and simply put all dev efforts into one well-maintained backend, but that's a discussion for another day)

@quickfur
Copy link
Member

Now that @WalterBright has been freed from being hand-tied by the dmd backend license, I wonder if we can convince him to either improve dmd's optimizer, or direct his efforts towards either the ldc or gdc backends.

To be fair to Walter, dmd's optimizer is not bad when you're not hard-pressed to squeeze every last CPU cycle out of your program. Where it fails to deliver are (1) overly conservative inlining, leading to missed opportunities of the domino effect of one inlining leading to further optimization opportunities, and (2) loop unrolling, which in many common cases also open up further optimization opportunities down the road. Besides loop unrolling, dmd's loop optimizations are actually pretty decent. But in both cases, what ultimately causes dmd to produce suboptimal code is the many missed opportunities to make a locally-small optimization that nevertheless would have led to bigger optimizations.

As long as this situation continues to obtain, I will continue to stick by my recommendation that performance-sensitive code (and by extension, performance measurements and optimization decisions) should be compiled with gdc/ldc, and dmd should only be used if performance is not critical and/or you absolutely need the latest bleeding-edge language features.

@quickfur
Copy link
Member

P.S. Not to mention that both gdc and ldc are able to target far more architectures than dmd, so if you're working with non-x86 architectures, dmd is not even a choice. And with our current manpower, I can't see wider arch support in dmd anytime in the foreseeable future (not to mention that it would be a huge waste of energy to duplicate what others have already done, energy that could be better spent elsewhere).

@jondegenhardt
Copy link
Contributor

My recent performance tests with LDC's LTO and PGO point to the quality of the optimization work coming out the LLVM community. By all accounts, the same is true of GCC as well. It certainly makes sense to target LDC/GDC for performance benchmarks.

It'd be worth having more tests than just mine, but.... My tests suggest it might be worth targeting a common setup where LDC's LTO was a build option, setting up LTO to encompass druntime, phobos, and DUB dependencies in addition to the application code. The big win in such a setup might simply be better inlining across all the module boundaries, but this appears to be a significant win.

Copy link
Contributor

@wilzbach wilzbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK with doc + encode change, but not okay with the foreach loop changes. As mentioned earlier, please only do such changes if you have looked at LDC's assembly (though the LDC people are pretty fast in fixing bugs/missed optimizations if you really identify one).

{
char[4] buf;
immutable size = encode!(UseReplacementDchar.yes)(buf, c);
immutable size = encode(buf, c);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw there was some discussions at #6091 (comment) (for the future - it would be great if you could either re-use the PR or reference such discussions at the releveant at the new PR as it took me a while to find the discussions and I almost missed it).

The annoyance that auto-decoding makes strings neither nothrow nor @nogc (though the undocumented -dip1008 for `@nogc~ exceptions has been merged and might soon be enabled by default) should make us really put an end to the auto-decoding hell.

FWIW there has been DIP76 once:

When the autodecoder encounters invalid UTF, it throws an exception. This proposal changes that so it replaces the invalid UTF with the Replacement Character.

I guess that one never went anywhere :/

Anyhow I'm still not sure I understand the motivation for this change.
Also it could have resulted in a breaking change as this could have changed the function from nothrow to throws.
It doesn't due to the nasty _handle having it's attributes specified manually and thus not being nothrow nor @nogc:

@property _iobuf* handle_() @trusted { return cast(_iobuf*) file_._p.handle; }

but I would like to see such things pointed out by the submitter and not needing to dig them out myself :/
There can also be a case made that the dchar version doesn't use the UseReplacementDchar version, but again the one proposing the change should bring up the arguments for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's time for us to make another big push to kill autodecoding for good. It's been almost universally hated by all, and while we did finally manage to convince Andrei the last time round that it's actually a bad thing (after many years of interminable arguments), we didn't go far enough, IMO -- we only got as far as introducing byCodeUnit, et al, and to try to reduce reliance on autodecoding in various Phobos functions.

But it has been an ongoing source of performance concerns, and now nothrow and @nogc unfriendliness, that perhaps we can make a very strong case for getting rid of it forever. I, for one, plan to celebrate the day the last of autodecoding is finally killed off from Phobos code forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quickfur Did you see this comment #6073 (comment) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RCStr has been talked about since, what, a year ago or more? When is it ever going to materialize? What's the holdup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quickfur DIP1000. I stalled Walter's PRs for it because I believed it wasn't handling GC-ed objects correctly (and Martin agreed #5052 (comment)). There's been no work on DIP1000 since my comments on the matter so I think Walter dropped it for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a list of what modules aren't compatible with DIP1000 as of now: https://github.com/dlang/phobos/projects/6

There has been some recent work on DIP1000 at #5915 and #6041. The first one stalled because DIP1000 uses a different mangling :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been some recent work on DIP1000 at #5915 and #6041

Ok, there's been no work by Walter since

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While dip1000 isn't perfect by any stretch, I still would welcome it as an improvement on the current situation.

std/stdio.d Outdated
// put each element in turn.
for (; !writeme.empty; writeme.popFront)
{
put(writeme.front);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.

}

// put each element in turn.
alias Elem = Unqual!(ElementType!A);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a welcome change!

std/stdio.d Outdated
alias Elem = Unqual!(ElementType!A);
foreach (Elem c; writeme)
// put each element in turn.
for (; !writeme.empty; writeme.popFront)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.

Have a look at what LDC is doing: https://godbolt.org/g/HuhD6T
It even inlines put completely:

void example.put(immutable(char)[]):
  mov rcx, qword ptr [rip + int example.put(immutable(char)[]).__foreachbody2(ref dchar)@GOTPCREL]
  xor edx, edx
  jmp _aApplycd1@PLT

int example.put(immutable(char)[]).__foreachbody2(ref dchar):
  push r14
  push rbx
  push rax
  mov edi, dword ptr [rsi]
  cmp edi, 127
  ja .LBB1_2
  xor esi, esi
  call fputc_unlocked@PLT
  jmp .LBB1_6
.LBB1_2:
  lea rsi, [rsp + 4]
  call _D3std3utf__T6encodeVEQu8typecons__T4FlagVAyaa19_7573655265706c6163656d656e744463686172ZQCai0ZQDdFNaNfJG4awZm@PLT
  mov r14, rax
  test r14, r14
  je .LBB1_6
  xor ebx, ebx
.LBB1_4:
  cmp rbx, 3
  ja .LBB1_7
  movzx edi, byte ptr [rsp + rbx + 4]
  xor esi, esi
  call fputc_unlocked@PLT
  inc rbx
  cmp rbx, r14
  jb .LBB1_4
.LBB1_6:
  xor eax, eax
  add rsp, 8
  pop rbx
  pop r14
  ret
.LBB1_7:
  lea rsi, [rip + .L.str]
  mov edi, 63
  mov edx, 39
  call _d_arraybounds@PLT

@CyberShadow
Copy link
Member

Before too much time is invested in this particular matter, is this even a case where performance matters? This is code that deals with IO and libc, so the for loops here (no matter how they're optimized) are not likely to be a bottleneck, are they?

@JackStouffer
Copy link
Contributor Author

@wilzbach I've stated my (long) disagreements with dip1008 in the NG. It's a change we will come to regret.

Technically, I'm not introducing any new throwing code because all wstrings with bad data already threw in the old code do to the explict auto-decoding. So, you could say I'm keeping existing behavior 😃

I'll remove the foreach changes.

* Removed code that caused wstrings to decode to dchars, and then encoded to chars
* Made the wchar path throw a UTF exception on bad data
* Improved the docs for LockingTextWriter
@dlang-bot dlang-bot merged commit a267043 into dlang:master Jan 30, 2018
@JackStouffer JackStouffer deleted the lockingtextwriter-put branch January 30, 2018 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants