Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize memcpy, memmove and memset #405
Optimize memcpy, memmove and memset #405
Changes from all commits
0c41d60
fcfecc1
3ad5fa9
2d28f4d
ce86d41
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use the
align_offset
method here to avoid bit-tricks?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From align_offset's doc:
So that rules out the use of align_offset. Also note that this is the very hot path and I would prefer simple bit trcisk rather to rely on LLVM to optimise out a very complex function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try benchmarking to see if there is overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to performance, it could simply not be used for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, have you tested simply copying using
ptr::read_unaligned
andptr::write_unaligned
? That way alignment wouldn't be an issue, but I don't know the performance impact that would have on various platforms.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, first of all
ptr::read_unaligned
andptr::write_aligned
callsptr::copy_nonoverlapping
which translates to memcpy, so I would like to avoid the possibility of an recursion if LLVM doesn't optimise them away.Secondly,
ptr::read_unaligned
has really poor performance. On platforms without misaligned memory access support it will translate tosize_of<usize>()
number of byte loads.This branch is necessary because we don't want to bear the burden of all the shifts and checks necessary for misaligned loads if dest and src are perfectly co-aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried this out and seen infinite recursion? Have you tried it out and seen if it's slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'll be an infinite recursion if compiled in debug mode.
On architectures that does not support misaligned loads (so most ISAs other than armv8 and x86/64) the performance is much slower because it generates 8 byte load and 16 bit ops rather than 1 word load and 3 bit ops.