-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize incr comp structures to file via fixed-size buffer #80463
Serialize incr comp structures to file via fixed-size buffer #80463
Conversation
r? @oli-obk (rust-highfive has picked a reviewer for you, use r? to override) |
@rustbot label T-compiler A-incr-comp I-compiletime I-compilemem |
Btw, if #80115 merges and this is rebased on and adapted for it, it should also reduce the impact on instruction count. |
The job Click to see the possible cause of the failure (guessed by this bot)
|
1dd1955
to
d2eb5f5
Compare
I would be curious to see how much this helps with #79103. |
@bors try @rust-timer queue |
Awaiting bors try build completion. |
⌛ Trying commit d2eb5f51983c8d7e73d8a14a8e25bc3e540268b9 with merge 9afe98c1ed87fbe0ee26f93e23a265c09372baf6... |
☀️ Try build successful - checks-actions |
Queued 9afe98c1ed87fbe0ee26f93e23a265c09372baf6 with parent d75f48e, future comparison URL. @rustbot label: +S-waiting-on-perf |
Finished benchmarking try commit (9afe98c1ed87fbe0ee26f93e23a265c09372baf6): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
The results are more extreme than what I see in local benchmarks. E.g. the 20% reduction I mentioned in the summary turned into a 30% reduction, but the instruction counts are worse. I'll investigate. |
d2eb5f5
to
27dadaf
Compare
I was hoping to reproduce the above perf results locally, but it's too much of a chore to figure out what's causing the difference. It's not that important anyway, provided it's not actually a correctness issue. I got some assurance that it isn't UB that's somehow causing the perf difference, by running some tests locally under Miri. I pushed couple of changes: a fix to an existing bug in the signed LEB128 decoding (it had no perf impact, and may not have been encountered in practice), an increase to the Regarding performance, I'd like to see how #80115 changes things before trying anything else. Though if anyone can see something that will make buffered file writing more competitive with the previous implemention, I'd gladly take a hint. |
Have you considered using mmap directly? That would need some unsafe code, but it shouldn't require an intermediate copy at all. |
I hadn't. I have some reservations, but it's worth experimenting with. |
I implemented an The numbers weren't very encouraging -- up to 7% worse in instruction count. The increase is probably due to some complexity in handling data that ends up straddling one I'm sure it could result in an improvement if optimized, but I just don't want to go down that path, personally. |
7de6bd2
to
44be53b
Compare
@bors try @rust-timer queue |
Awaiting bors try build completion. |
⌛ Trying commit cce99b3bdd4c695026fda94fea711e5f0d7c7ee5 with merge 007744ee92eae311ea94065ead37aa138ea30e30... |
☀️ Try build successful - checks-actions |
Queued 007744ee92eae311ea94065ead37aa138ea30e30 with parent 7cf2056, future comparison URL. @rustbot label: +S-waiting-on-perf |
Finished benchmarking try commit (007744ee92eae311ea94065ead37aa138ea30e30): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Perf result confirms that larger buffer doesn't help. |
I'm looking into the overhead of error handling. From the assembly, it seems like there's unnecessary overhead in On first glance, it looks like creation of the Here is the assembly for push %rbp
push %rbx
push %rax
mov %esi,%ebp
mov 0x8(%rdi),%rbx
mov 0x10(%rbx),%rax
lea 0x5(%rax),%rcx
cmp 0x8(%rbx),%rcx
/-------- ja
/--|-------> mov (%rbx),%rcx
| | add %rax,%rcx
| | mov $0x1,%edx
| | cmp $0x80,%ebp
| | /----- jb
| | | mov %ebp,%esi
| | | /-> mov %esi,%edi
| | | | or $0x80,%bpl
| | | | mov %bpl,(%rcx)
| | | | shr $0x7,%esi
| | | | add $0x1,%rcx
| | | | add $0x1,%rdx
| | | | mov %esi,%ebp
| | | | cmp $0x3fff,%edi
| | | \-- ja
| | | mov %esi,%ebp
| | \----> mov %bpl,(%rcx)
| | add %rax,%rdx
| | mov %rdx,0x10(%rbx)
| | mov $0x3,%al // Start of Result creation in Ok path
| | /----> shld $0x8,%rcx,%rdx // Continuation of Result creation in Ok and Err paths
| | | shl $0x8,%rcx
| | | movzbl %al,%eax
| | | or %rcx,%rax
| | | add $0x8,%rsp
| | | pop %rbx
| | | pop %rbp
| | | retq
| \--|----> mov %rbx,%rdi
| | callq *0x168146d(%rip) # <<rustc_serialize::opaque::FileEncoder>::flush@@Base+0xe7b158>
| | cmp $0x3,%al
| | /-- jne
| | | xor %eax,%eax
\-----|--|-- jmp
| \-> mov %rax,%rcx // Start of Result creation in Err path (when flush() fails)
| shrd $0x8,%rdx,%rcx
| shr $0x8,%rdx
\----- jmp The result is a The formation of the | | mov $0x3,%al // Set Ok variant, I assume.
| | /----> shld $0x8,%rcx,%rdx // Seems irrelevant to Ok Result.
| | | shl $0x8,%rcx // Seems irrelevant to Ok Result.
| | | movzbl %al,%eax // eax = Ok variant, zero-extended.
| | | or %rcx,%rax // rax = Ok variant plus seemingly irrelevant stuff in rcx. I think the above could be replaced with something like The formation of the // ...just called flush, and found that result in rax was Err...
| \-> mov %rax,%rcx
| shrd $0x8,%rdx,%rcx
| shr $0x8,%rdx
\----- jmp
// ...jumps to the shld from previously quoted block...
| | /----> shld $0x8,%rcx,%rdx
| | | shl $0x8,%rcx
| | | movzbl %al,%eax
| | | or %rcx,%rax IIUC, this sequence doesn't change If all the above is correct, fixing this would remove 4 instructions from the hot path, and 7 from the cold. That alone would help, since the hot code is really hot, but it might also unlock other optimizations. Anyway, I'm hoping my reasoning is right, and I'll continue looking into this. |
Should we merge this PR for the memory improvement, and let the perf side for another PR? |
I'm fine with that, but if @tgnottingham still has motivation to munch away at the perf part, I'm not going to stop them 😆 |
Reduce a large memory spike that happens during serialization by writing the incr comp structures to file by way of a fixed-size buffer, rather than an unbounded vector. Effort was made to keep the instruction count close to that of the previous implementation. However, buffered writing to a file inherently has more overhead than writing to a vector, because each write may result in a handleable error. To reduce this overhead, arrangements are made so that each LEB128-encoded integer can be written to the buffer with only one capacity and error check. Higher-level optimizations in which entire composite structures can be written with one capacity and error check are possible, but would require much more work. The performance is mostly on par with the previous implementation, with small to moderate instruction count regressions. The memory reduction is significant, however, so it seems like a worth-while trade-off.
The signed LEB128 decoding function used a hardcoded constant of 64 instead of the number of bits in the type of integer being decoded, which resulted in incorrect results for some inputs. Fix this, make the decoding more consistent with the unsigned version, and increase the LEB128 encoding and decoding test coverage.
cce99b3
to
f15fae8
Compare
Rebased and returned to using default buffer size.
I'd love that. :) I'll create an issue for the |
@bors r=oli-obk rollup=never |
📌 Commit f15fae8 has been approved by |
☀️ Test successful - checks-actions |
Created #81146 for the |
Reduce a large memory spike that happens during serialization by writing
the incr comp structures to file by way of a fixed-size buffer, rather
than an unbounded vector.
Effort was made to keep the instruction count close to that of the
previous implementation. However, buffered writing to a file inherently
has more overhead than writing to a vector, because each write may
result in a handleable error. To reduce this overhead, arrangements are
made so that each LEB128-encoded integer can be written to the buffer
with only one capacity and error check. Higher-level optimizations in
which entire composite structures can be written with one capacity and
error check are possible, but would require much more work.
The performance is mostly on par with the previous implementation, with
small to moderate instruction count regressions. The memory reduction is
significant, however, so it seems like a worth-while trade-off.