-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct copy specialization #65
Conversation
`StorageBuffer`, and make `GpuArrayBuffer` use it. `EncasedBufferVec` is like `BufferVec`, but it doesn't require that the type be `Pod`. Alternately, it's like `StorageBuffer<Vec<T>>`, except it doesn't allow CPU access to the data after it's been pushed. `GpuArrayBuffer` already doesn't allow CPU access to the data, so switching it to use `EncasedBufferVec` doesn't regress any functionality and offers higher performance. Shutting off CPU access eliminates the need to copy to a scratch buffer, which results in significantly higher performance. *Note that this needs teoxoy/encase#65 from @james7132 to achieve end-to-end performance benefits*, because `encase` is rather slow at encoding data without that patch, swamping the benefits of avoiding the copy. With that patch applied, and `#[inline]` added to `encase`'s `derive` implementation of `write_into` on structs, this results in a *16% overall speedup on `many_cubes --no-frustum-culling`*. I've verified that the generated code is now close to optimal. The only reasonable potential improvement that I see is to eliminate the zeroing in `push`. This requires unsafe code, however, so I'd prefer to leave that to a followup.
2901ec1
to
4ee8377
Compare
I think this is about as clean/low-unsafe as I can get this without negatively impacting the performance again. @teoxoy this should be ready for review now. |
@james7132 I pushed a commit to the PR could you give it a look? |
I actually tried to deduplicate code like this, but apparently the optimizer doesn't like implicit if-returns as replacements for if/elses. I'll check the assembly output of this later. |
I think the Pod changes are fine, but the cleanup in how the branching is handled the compiler really doesn't like: james7132/encase_asm_tests@f936005#diff-ce2d3c1fb388fe83071754963a5343392f61ff8b7d7465aa6d2ba68dfe31993b. This shows a regression in codegen for Vec-like writes. Though I think this comes down to the fact that it cannot inline the reservation, and thus it's assertion that there's enough space does not carry over, unlike the slice writes. |
Ah I think I see why, it views the non-memcpy writes/reads as a potential side effect, and thus cannot treat the return as a replacement for an else to the if. |
@james7132 I pushed another commit to address the issue. Let me know what you think! |
Doesn't seem to make a tangible difference: james7132/encase_asm_tests@4399891. Might be due to one of the |
Mostly resolves the issues found in #53. This PR specializes writes, reads, and creates of types by directly using
slice::copy_from_slice
on them if compiling for a little endian target, and only if the type has no internal padding.Checking the generated assembly, this eliminates a huge number of branches and even starts using vectorized copies and calls to
memcpy
directly for larger types.