-
-
Notifications
You must be signed in to change notification settings - Fork 746
Modified the behavior of std.stdio.LockingTextWriter.put #6092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your pull request, @JackStouffer! Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. |
417c8c5 to
99e7478
Compare
| auto result = trustedFwrite(file_._p.handle, writeme); | ||
| if (result != writeme.length) errnoEnforce(0); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't the else case (i.e., orientation_ > 0) have a bug, now that you made the original fallthrough code an else to the static if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Can't say I'm a fan of the copy-pasta. What about inserting a return for the isSomeString!A && C.sizeof > 1 case, and getting rid of the else after that, so that we can merge these two identical for-loops? I feel bad for nitpicking on this level, but the code duplication seems unwarranted in this case.
std/stdio.d
Outdated
| for (; !writeme.empty; writeme.popFront) | ||
| { | ||
| put(writeme.front); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a strange way to write the loop. What's wrong with just foreach (e; writeme) put(e);?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And on that note, why not just use foreach (c; writeme) put(e); for both branches of the static if? In which case we could just fold it into a single generic loop. AFAICT, the only real difference is that we have foreach (c; writeme) rather than foreach (Elem c; writeme) which would force auto-decoding in the case of a wide string. Getting rid of the original stuff surrounding Elem should take care of that.
(Though granted, the difference is rather subtle; a comment explaining this might be in order.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a strange way to write the loop. What's wrong with just
See the ASM here: #6091 (comment)
An unnecessary copy of writeme is made regardless of optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Does gdc/ldc optimize that out? It seems kinda sad that the construct that's supposed to natively support ranges would also introduce such blatant inefficiencies.
(Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well. But the optimizer should be smart enough, at least in theory, to notice that the original value of the range argument isn't needed, so the copy could be eliminated.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does gdc/ldc optimize that out?
I can't make heads or tails of ldc's ASM output. You can see it here https://run.dlang.io/is/ck7lsp
I'm going to assume "no", because there are cases where not making a copy could have subtle behavior changes.
Though I can understand that historically, we've always made a copy of slices when running foreach over them, so that foreach doesn't unexpectedly consume your slice. I suppose the same argument could be made for other range types as well.
Yeah that's the idea, and it makes sense for user code to always use foreach regardless for simplicity's sake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, it seems ldc's aggressive optimizer has defeated us once again. You might want to put the code into a public function that returns a value, because ldc's optimizer has apparently noticed that the value computed by the loop is never used, so it has completely elided the entire body of main() to just return 0. :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(And I just tried changing the code to return the value computed by the loop: again, ldc's optimizer is very aggressive and decided to execute the loop at compile-time, so main() just returns the precomputed value directly without any computation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the only way I could convince ldc not to replace the function body with the compile-time evaluated result is to make the range a function parameter, with a forced instantiation via an alias: here.
The function body uses a whole bunch of fancy XMM vectorized operation setup, then a very tight loop summing the values:
.cfi_startproc
testq %rdi, %rdi
je .LBB14_1
cmpq $8, %rdi
jae .LBB14_4
xorl %eax, %eax
jmp .LBB14_12
.LBB14_1:
xorl %eax, %eax
retq
.LBB14_4:
movq %rdi, %r8
andq $-8, %r8
leaq -8(%r8), %rdx
shrq $3, %rdx
leal 1(%rdx), %eax
andl $1, %eax
testq %rdx, %rdx
je .LBB14_5
leaq -1(%rax), %rcx
subq %rdx, %rcx
pxor %xmm2, %xmm2
xorl %edx, %edx
pxor %xmm0, %xmm0
pxor %xmm1, %xmm1
.p2align 4, 0x90
.LBB14_7:
movd (%rsi,%rdx), %xmm3
movd 4(%rsi,%rdx), %xmm4
punpcklbw %xmm2, %xmm3
punpcklwd %xmm2, %xmm3
paddd %xmm0, %xmm3
punpcklbw %xmm2, %xmm4
punpcklwd %xmm2, %xmm4
paddd %xmm1, %xmm4
movd 8(%rsi,%rdx), %xmm0
movd 12(%rsi,%rdx), %xmm1
punpcklbw %xmm2, %xmm0
punpcklwd %xmm2, %xmm0
paddd %xmm3, %xmm0
punpcklbw %xmm2, %xmm1
punpcklwd %xmm2, %xmm1
paddd %xmm4, %xmm1
addq $16, %rdx
addq $2, %rcx
jne .LBB14_7
testq %rax, %rax
je .LBB14_10
.LBB14_9:
movd 4(%rsi,%rdx), %xmm2
pxor %xmm3, %xmm3
punpcklbw %xmm3, %xmm2
punpcklwd %xmm3, %xmm2
paddd %xmm2, %xmm1
movd (%rsi,%rdx), %xmm2
punpcklbw %xmm3, %xmm2
punpcklwd %xmm3, %xmm2
paddd %xmm2, %xmm0
.LBB14_10:
paddd %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1
paddd %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0
paddd %xmm1, %xmm0
movd %xmm0, %eax
cmpq %r8, %rdi
je .LBB14_13
subq %r8, %rdi
addq %r8, %rsi
.p2align 4, 0x90
.LBB14_12:
movzbl (%rsi), %ecx
addl %ecx, %eax
incq %rsi
decq %rdi
jne .LBB14_12
.LBB14_13:
retq
.LBB14_5:
pxor %xmm0, %xmm0
xorl %edx, %edx
pxor %xmm1, %xmm1
testq %rax, %rax
jne .LBB14_9
jmp .LBB14_10
This was with a foreach loop. With a manual for loop, I get:
.cfi_startproc
testq %rdi, %rdi
je .LBB14_1
cmpq $8, %rdi
jae .LBB14_4
xorl %eax, %eax
jmp .LBB14_12
.LBB14_1:
xorl %eax, %eax
retq
.LBB14_4:
movq %rdi, %r8
andq $-8, %r8
leaq -8(%r8), %rdx
shrq $3, %rdx
leal 1(%rdx), %eax
andl $1, %eax
testq %rdx, %rdx
je .LBB14_5
leaq -1(%rax), %rcx
subq %rdx, %rcx
pxor %xmm2, %xmm2
xorl %edx, %edx
pxor %xmm0, %xmm0
pxor %xmm1, %xmm1
.p2align 4, 0x90
.LBB14_7:
movd (%rsi,%rdx), %xmm3
movd 4(%rsi,%rdx), %xmm4
punpcklbw %xmm2, %xmm3
punpcklwd %xmm2, %xmm3
paddd %xmm0, %xmm3
punpcklbw %xmm2, %xmm4
punpcklwd %xmm2, %xmm4
paddd %xmm1, %xmm4
movd 8(%rsi,%rdx), %xmm0
movd 12(%rsi,%rdx), %xmm1
punpcklbw %xmm2, %xmm0
punpcklwd %xmm2, %xmm0
paddd %xmm3, %xmm0
punpcklbw %xmm2, %xmm1
punpcklwd %xmm2, %xmm1
paddd %xmm4, %xmm1
addq $16, %rdx
addq $2, %rcx
jne .LBB14_7
testq %rax, %rax
je .LBB14_10
.LBB14_9:
movd 4(%rsi,%rdx), %xmm2
pxor %xmm3, %xmm3
punpcklbw %xmm3, %xmm2
punpcklwd %xmm3, %xmm2
paddd %xmm2, %xmm1
movd (%rsi,%rdx), %xmm2
punpcklbw %xmm3, %xmm2
punpcklwd %xmm3, %xmm2
paddd %xmm2, %xmm0
.LBB14_10:
paddd %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1
paddd %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0
paddd %xmm1, %xmm0
movd %xmm0, %eax
cmpq %r8, %rdi
je .LBB14_13
subq %r8, %rdi
addq %r8, %rsi
.p2align 4, 0x90
.LBB14_12:
movzbl (%rsi), %ecx
addl %ecx, %eax
incq %rsi
decq %rdi
jne .LBB14_12
.LBB14_13:
retq
.LBB14_5:
pxor %xmm0, %xmm0
xorl %edx, %edx
pxor %xmm1, %xmm1
testq %rax, %rax
jne .LBB14_9
jmp .LBB14_10
.Lfunc_end14:
It's too long to go through manually, but a diff of the two asm blocks showed that they are instruction-for-instruction identical. IOW, changing from foreach to for did not make any difference to ldc.
68f1ba7 to
a931945
Compare
Damn issue 14835! |
a931945 to
5795762
Compare
|
Had to go back to the non DRY version |
|
So basically, with ldc there is no difference between the So this boils down to whether we feel it's worth uglifying Phobos for the sake of dmd, or leave it up to better optimizers like ldc/gdc to do the dirty work for us. |
|
P.S., though obviously, eliminating the needless decoding of wide strings to |
Not possible. It's done explicitly by the foreach in the past code. |
Fighting the optimizer: benchmarks@JackStouffer @quickfur I noticed that you two were fighting the optimizer lately, but haven't had time to reply. There are two trick to fight the optimizer for benchmarks: private void doNotOptimizeAway(T)(auto ref T t)
{
import core.thread : getpid;
import core.stdc.stdio;
if(getpid() == 1) {
printf("%s", *cast(char*)&t);
}
}Example: https://gist.github.com/wilzbach/3407d80bfa757d46a3ac59a873d5f085 I actually tried to push it once (#5416) and we should really try to standardize this - it's the most common pitfall when benchmarking. (There's also the trick of using separate compilation to fight the optimizer, but that's usually more work (and harder to share because it's not a single file anymore), so I'm not going to suggest it. However, if you want to see this trick in practice, see https://github.com/wilzbach/d-benchmarks). Fighting the optimizer: asmIf you look at the ASM there are two tricks: void bar(string foo)
{
import std.utf;
import core.stdc.stdio;
auto m = foo.byCodeUnit;
foreach (c; m) {
printf("%c", c);
}
}which produces: void example.bar(immutable(char)[]):
push r14
push rbx
push rax
mov rbx, rsi
mov r14, rdi
test r14, r14
je .LBB0_2
.LBB0_1:
movzx edi, byte ptr [rbx]
call putchar@PLT ; <- inner loop statement
inc rbx
dec r14
jne .LBB0_1 ; <- GOTO of the loop
.LBB0_2:
add rsp, 8
pop rbx
pop r14
retSo both
I vote against. Any performance critical application will use ldc or gdc. dmd code is slow as a snail compared to LDC anyhow so these few things don't help anyways. (OT: I wonder whether we finally reached the point to make LDC the default compiler and simply put all dev efforts into one well-maintained backend, but that's a discussion for another day) |
|
Now that @WalterBright has been freed from being hand-tied by the dmd backend license, I wonder if we can convince him to either improve dmd's optimizer, or direct his efforts towards either the ldc or gdc backends. To be fair to Walter, dmd's optimizer is not bad when you're not hard-pressed to squeeze every last CPU cycle out of your program. Where it fails to deliver are (1) overly conservative inlining, leading to missed opportunities of the domino effect of one inlining leading to further optimization opportunities, and (2) loop unrolling, which in many common cases also open up further optimization opportunities down the road. Besides loop unrolling, dmd's loop optimizations are actually pretty decent. But in both cases, what ultimately causes dmd to produce suboptimal code is the many missed opportunities to make a locally-small optimization that nevertheless would have led to bigger optimizations. As long as this situation continues to obtain, I will continue to stick by my recommendation that performance-sensitive code (and by extension, performance measurements and optimization decisions) should be compiled with gdc/ldc, and dmd should only be used if performance is not critical and/or you absolutely need the latest bleeding-edge language features. |
|
P.S. Not to mention that both gdc and ldc are able to target far more architectures than dmd, so if you're working with non-x86 architectures, dmd is not even a choice. And with our current manpower, I can't see wider arch support in dmd anytime in the foreseeable future (not to mention that it would be a huge waste of energy to duplicate what others have already done, energy that could be better spent elsewhere). |
|
My recent performance tests with LDC's LTO and PGO point to the quality of the optimization work coming out the LLVM community. By all accounts, the same is true of GCC as well. It certainly makes sense to target LDC/GDC for performance benchmarks. It'd be worth having more tests than just mine, but.... My tests suggest it might be worth targeting a common setup where LDC's LTO was a build option, setting up LTO to encompass druntime, phobos, and DUB dependencies in addition to the application code. The big win in such a setup might simply be better inlining across all the module boundaries, but this appears to be a significant win. |
wilzbach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK with doc + encode change, but not okay with the foreach loop changes. As mentioned earlier, please only do such changes if you have looked at LDC's assembly (though the LDC people are pretty fast in fixing bugs/missed optimizations if you really identify one).
| { | ||
| char[4] buf; | ||
| immutable size = encode!(UseReplacementDchar.yes)(buf, c); | ||
| immutable size = encode(buf, c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw there was some discussions at #6091 (comment) (for the future - it would be great if you could either re-use the PR or reference such discussions at the releveant at the new PR as it took me a while to find the discussions and I almost missed it).
The annoyance that auto-decoding makes strings neither nothrow nor @nogc (though the undocumented -dip1008 for `@nogc~ exceptions has been merged and might soon be enabled by default) should make us really put an end to the auto-decoding hell.
FWIW there has been DIP76 once:
When the autodecoder encounters invalid UTF, it throws an exception. This proposal changes that so it replaces the invalid UTF with the Replacement Character.
I guess that one never went anywhere :/
Anyhow I'm still not sure I understand the motivation for this change.
Also it could have resulted in a breaking change as this could have changed the function from nothrow to throws.
It doesn't due to the nasty _handle having it's attributes specified manually and thus not being nothrow nor @nogc:
Line 2825 in afa7d58
| @property _iobuf* handle_() @trusted { return cast(_iobuf*) file_._p.handle; } |
but I would like to see such things pointed out by the submitter and not needing to dig them out myself :/
There can also be a case made that the dchar version doesn't use the UseReplacementDchar version, but again the one proposing the change should bring up the arguments for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's time for us to make another big push to kill autodecoding for good. It's been almost universally hated by all, and while we did finally manage to convince Andrei the last time round that it's actually a bad thing (after many years of interminable arguments), we didn't go far enough, IMO -- we only got as far as introducing byCodeUnit, et al, and to try to reduce reliance on autodecoding in various Phobos functions.
But it has been an ongoing source of performance concerns, and now nothrow and @nogc unfriendliness, that perhaps we can make a very strong case for getting rid of it forever. I, for one, plan to celebrate the day the last of autodecoding is finally killed off from Phobos code forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quickfur Did you see this comment #6073 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RCStr has been talked about since, what, a year ago or more? When is it ever going to materialize? What's the holdup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@quickfur DIP1000. I stalled Walter's PRs for it because I believed it wasn't handling GC-ed objects correctly (and Martin agreed #5052 (comment)). There's been no work on DIP1000 since my comments on the matter so I think Walter dropped it for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a list of what modules aren't compatible with DIP1000 as of now: https://github.com/dlang/phobos/projects/6
There has been some recent work on DIP1000 at #5915 and #6041. The first one stalled because DIP1000 uses a different mangling :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While dip1000 isn't perfect by any stretch, I still would welcome it as an improvement on the current situation.
std/stdio.d
Outdated
| // put each element in turn. | ||
| for (; !writeme.empty; writeme.popFront) | ||
| { | ||
| put(writeme.front); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.
| } | ||
|
|
||
| // put each element in turn. | ||
| alias Elem = Unqual!(ElementType!A); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a welcome change!
std/stdio.d
Outdated
| alias Elem = Unqual!(ElementType!A); | ||
| foreach (Elem c; writeme) | ||
| // put each element in turn. | ||
| for (; !writeme.empty; writeme.popFront) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. As pointed out in the discussion, LDC/GDC are smart enough.
The code should be improved to be more high-level and abstract, not the other way around.
Have a look at what LDC is doing: https://godbolt.org/g/HuhD6T
It even inlines put completely:
void example.put(immutable(char)[]):
mov rcx, qword ptr [rip + int example.put(immutable(char)[]).__foreachbody2(ref dchar)@GOTPCREL]
xor edx, edx
jmp _aApplycd1@PLT
int example.put(immutable(char)[]).__foreachbody2(ref dchar):
push r14
push rbx
push rax
mov edi, dword ptr [rsi]
cmp edi, 127
ja .LBB1_2
xor esi, esi
call fputc_unlocked@PLT
jmp .LBB1_6
.LBB1_2:
lea rsi, [rsp + 4]
call _D3std3utf__T6encodeVEQu8typecons__T4FlagVAyaa19_7573655265706c6163656d656e744463686172ZQCai0ZQDdFNaNfJG4awZm@PLT
mov r14, rax
test r14, r14
je .LBB1_6
xor ebx, ebx
.LBB1_4:
cmp rbx, 3
ja .LBB1_7
movzx edi, byte ptr [rsp + rbx + 4]
xor esi, esi
call fputc_unlocked@PLT
inc rbx
cmp rbx, r14
jb .LBB1_4
.LBB1_6:
xor eax, eax
add rsp, 8
pop rbx
pop r14
ret
.LBB1_7:
lea rsi, [rip + .L.str]
mov edi, 63
mov edx, 39
call _d_arraybounds@PLT|
Before too much time is invested in this particular matter, is this even a case where performance matters? This is code that deals with IO and libc, so the for loops here (no matter how they're optimized) are not likely to be a bottleneck, are they? |
|
@wilzbach I've stated my (long) disagreements with dip1008 in the NG. It's a change we will come to regret. Technically, I'm not introducing any new throwing code because all I'll remove the foreach changes. |
5795762 to
4f6b6fd
Compare
* Removed code that caused wstrings to decode to dchars, and then encoded to chars * Made the wchar path throw a UTF exception on bad data * Improved the docs for LockingTextWriter
4f6b6fd to
88c1e4f
Compare
wstrings todchars, and then encoding them tochars. Or in the case of writingwstringto a wide file,putno long has to do any encoding or decoding at all, where as it did a bunch of extra work before.wcharpath throw a UTF exception on bad dataCC @CyberShadow