-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zcommon: Refactor FPU state handling in fletcher4 #14600
zcommon: Refactor FPU state handling in fletcher4 #14600
Conversation
Not sure I did the right thing with the abi file, the diff seems pretty huge. But at least checkstyle is happy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, but we have a minor optimization opportunity that we probably should exploit as part of this. See my comment on the struct.
@@ -126,6 +126,7 @@ typedef struct fletcher_4_func { | |||
fletcher_4_fini_f fini_byteswap; | |||
fletcher_4_compute_f compute_byteswap; | |||
boolean_t (*valid)(void); | |||
boolean_t uses_fpu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that we could mark this structure as requiring 64-byte alignment so that subsequent accesses are L1 cache accesses.
As a more general change, we could also rename ->init_native
and ->fini_native
to ->init
and ->fini
respectively and drop the separate init_byteswap()
/fini_byteswap()
functions, since we only need to have different compute functions to support different endianness. That should be a separate commit, but it is entirely optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about that, it would squeeze the size of the struct below 64 bytes again, currently it is 72 bytes. Let me have a more thorough look tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fitting the entire structure in a cacheline does not matter that much here because the name field is not accessed during normal operation, so just fitting everything but that is fine. I just thought it would be nice to simplify this code because the extra init and fini functions make it slightly harder to understand.
Aligning the structure to a cacheline on the other hand should avoid another memory access whenever the fields needed during normal operation would have spanned two cachelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove init_byteswap()/fini_byteswap()
in a follow up PR. Just wondering why they were implemented in the first place.
ops->compute_byteswap(&ctx, buf, size); | ||
if (ops->uses_fpu == B_TRUE) { | ||
kfpu_end(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kfpu_begin()
should be called before init_byteswap()
and kfpu_end()
should be called after fini_byteswap()
. That way the init and fini functions are allowed to use vector instructions. A future patch to compile GNU C Vector code likely will make use of this.
The same applies to the native versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was to minimize the time we're running with interrupts and preemption disabled. So this is a trade-off between ease of future changes and an actual advantage. I'm fine either way, let's see what other reviewers think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
59493b6 already made a decent reduction in the total time that we spend with interrupts disabled not that long ago by not doing context save/restore multiple times per ABD. At the same time, it did kfpu_begin()
before the rest of the init function and kfpu_end()
after the rest of the fini function, which passed review.
An extra microsecond or two with interrupts disabled will not matter in the slightest since the systems we support do not give real time guarantees. If we were running on a system that tried to give real time guarantees and it were so pedantic to give guarantees that were accurate to a level where a microsecond or two difference would matter, we probably would not be allowed to run with interrupts disabled at all, since disabling interrupts to use SIMD could cause the real time guarantee to be violated. I am not sure if even real-time Linux is that strict.
Anyway, what is important here is not minimizing time with interrupts disabled, but minimizing the total time spent doing processing. If the interrupts are disabled slightly longer than absolutely necessary to reduce the total CPU time spent, then that is preferable. In this case, doing it this way is mostly to prepare for a future small reduction in CPU time. Since we know that there is a possibility of that here, I would prefer us to do it that way so that we do not need another patch. We really are not gaining anything from an extra microsecond or two with interrupts enabled per checksum operation, so there is no need to be frugal on that level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sound reasoning, will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
You can grab an updated ABI file here. It should only include updates for the interfaces which changed. https://github.com/openzfs/zfs/actions/runs/4377972107 |
@behlendorf Yeah, I did, still $ git diff HEAD~1 HEAD lib/libzfs/libzfs.abi| wc -l
9512 |
I wonder if perhaps the last time the ABI was generated it was under ubuntu 20.04. We've moved to ubuntu 22.04 for the checkstyle builder now and we've certainly seen updating the environment and version of libabigail vastly changed the exact output. |
Perhaps the commit updating the ABI should be a separate commit prior to this one, so that this commit stays small. |
@behlendorf we use ghcr.io/openzfs/libabigail image, so libabigail should be the same, but I'm not sure what can affect it in this case. Ah, maybe the fact that we're compiling in different release now with different deps. |
That seems to be the case, at least it would explain e.g 'libcrypto.so.1.1 -> libcrypto.so.3. It would also explain that the tar file contained all .abi files. So, what's the way to proceed? |
Give me 1 hour, I'll generate ABI files from pure master and make a PR to get meaningful diffs later in this patch. Done: #14605 |
Thanks, no reason to hurry, take your time. EDIT: Actually you were faster in fixing than me in replying. |
Merged #14605 |
34d4a59
to
bcab66d
Compare
Addressed review feedback, added ABI file and updated PR description to indicate an ABI change. |
bcab66d
to
fff528d
Compare
One nit about the title. Instead of writing |
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Signed-off-by: Attila Fülöp <attila@fueloep.org>
fff528d
to
5d45089
Compare
Agreed, changed the commit message and the PR title. |
Ahhh hah, so close :)
Since |
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in openzfs#14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes openzfs#14600
Motivation and Context
Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS.
Please see also discussion #14574.
Description
To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites.
How Has This Been Tested?
Parallel runs of 2 x
fio
and amprime -t
for 30 minutes.mprime -t
usually fails almost immediately in case of FPU save and restore problems.Types of changes
Checklist:
Signed-off-by
.