Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zcommon: Refactor FPU state handling in fletcher4 #14600

Merged
merged 1 commit into from
Mar 14, 2023

Conversation

AttilaFueloep
Copy link
Contributor

@AttilaFueloep AttilaFueloep commented Mar 9, 2023

Motivation and Context

Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS.

Please see also discussion #14574.

Description

To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites.

How Has This Been Tested?

Parallel runs of 2 x fio and a mprime -t for 30 minutes. mprime -t usually fails almost immediately in case of FPU save and restore problems.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@AttilaFueloep AttilaFueloep mentioned this pull request Mar 9, 2023
13 tasks
@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Mar 9, 2023

Not sure I did the right thing with the abi file, the diff seems pretty huge. But at least checkstyle is happy.

Copy link
Contributor

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, but we have a minor optimization opportunity that we probably should exploit as part of this. See my comment on the struct.

@@ -126,6 +126,7 @@ typedef struct fletcher_4_func {
fletcher_4_fini_f fini_byteswap;
fletcher_4_compute_f compute_byteswap;
boolean_t (*valid)(void);
boolean_t uses_fpu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could mark this structure as requiring 64-byte alignment so that subsequent accesses are L1 cache accesses.

As a more general change, we could also rename ->init_native and ->fini_native to ->init and ->fini respectively and drop the separate init_byteswap()/fini_byteswap() functions, since we only need to have different compute functions to support different endianness. That should be a separate commit, but it is entirely optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about that, it would squeeze the size of the struct below 64 bytes again, currently it is 72 bytes. Let me have a more thorough look tomorrow.

Copy link
Contributor

@ryao ryao Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fitting the entire structure in a cacheline does not matter that much here because the name field is not accessed during normal operation, so just fitting everything but that is fine. I just thought it would be nice to simplify this code because the extra init and fini functions make it slightly harder to understand.

Aligning the structure to a cacheline on the other hand should avoid another memory access whenever the fields needed during normal operation would have spanned two cachelines.

Copy link
Contributor Author

@AttilaFueloep AttilaFueloep Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove init_byteswap()/fini_byteswap() in a follow up PR. Just wondering why they were implemented in the first place.

ops->compute_byteswap(&ctx, buf, size);
if (ops->uses_fpu == B_TRUE) {
kfpu_end();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfpu_begin() should be called before init_byteswap() and kfpu_end() should be called after fini_byteswap(). That way the init and fini functions are allowed to use vector instructions. A future patch to compile GNU C Vector code likely will make use of this.

The same applies to the native versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to minimize the time we're running with interrupts and preemption disabled. So this is a trade-off between ease of future changes and an actual advantage. I'm fine either way, let's see what other reviewers think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

59493b6 already made a decent reduction in the total time that we spend with interrupts disabled not that long ago by not doing context save/restore multiple times per ABD. At the same time, it did kfpu_begin() before the rest of the init function and kfpu_end() after the rest of the fini function, which passed review.

An extra microsecond or two with interrupts disabled will not matter in the slightest since the systems we support do not give real time guarantees. If we were running on a system that tried to give real time guarantees and it were so pedantic to give guarantees that were accurate to a level where a microsecond or two difference would matter, we probably would not be allowed to run with interrupts disabled at all, since disabling interrupts to use SIMD could cause the real time guarantee to be violated. I am not sure if even real-time Linux is that strict.

Anyway, what is important here is not minimizing time with interrupts disabled, but minimizing the total time spent doing processing. If the interrupts are disabled slightly longer than absolutely necessary to reduce the total CPU time spent, then that is preferable. In this case, doing it this way is mostly to prepare for a future small reduction in CPU time. Since we know that there is a possibility of that here, I would prefer us to do it that way so that we do not need another patch. We really are not gaining anything from an extra microsecond or two with interrupts enabled per checksum operation, so there is no need to be frugal on that level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sound reasoning, will change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@lundman lundman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@behlendorf
Copy link
Contributor

Not sure I did the right thing with the abi file, the diff seems pretty huge. But at least checkstyle is happy.

You can grab an updated ABI file here. It should only include updates for the interfaces which changed.

https://github.com/openzfs/zfs/actions/runs/4377972107

@AttilaFueloep
Copy link
Contributor Author

@behlendorf Yeah, I did, still

$ git diff HEAD~1 HEAD lib/libzfs/libzfs.abi| wc -l
9512

@behlendorf
Copy link
Contributor

I wonder if perhaps the last time the ABI was generated it was under ubuntu 20.04. We've moved to ubuntu 22.04 for the checkstyle builder now and we've certainly seen updating the environment and version of libabigail vastly changed the exact output.

@ryao
Copy link
Contributor

ryao commented Mar 10, 2023

I wonder if perhaps the last time the ABI was generated it was under ubuntu 20.04. We've moved to ubuntu 22.04 for the checkstyle builder now and we've certainly seen updating the environment and version of libabigail vastly changed the exact output.

Perhaps the commit updating the ABI should be a separate commit prior to this one, so that this commit stays small.

@gmelikov
Copy link
Member

@behlendorf we use ghcr.io/openzfs/libabigail image, so libabigail should be the same, but I'm not sure what can affect it in this case. Ah, maybe the fact that we're compiling in different release now with different deps.

@AttilaFueloep
Copy link
Contributor Author

Ah, maybe the fact that we're compiling in different release now with different deps.

That seems to be the case, at least it would explain e.g 'libcrypto.so.1.1 -> libcrypto.so.3. It would also explain that the tar file contained all .abi files.

So, what's the way to proceed?

@gmelikov
Copy link
Member

gmelikov commented Mar 10, 2023

Give me 1 hour, I'll generate ABI files from pure master and make a PR to get meaningful diffs later in this patch.

Done: #14605

@AttilaFueloep
Copy link
Contributor Author

AttilaFueloep commented Mar 10, 2023

Thanks, no reason to hurry, take your time.

EDIT: Actually you were faster in fixing than me in replying.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 11, 2023
@behlendorf
Copy link
Contributor

Merged #14605

@AttilaFueloep AttilaFueloep force-pushed the zfs-refactor-fletcher branch from 34d4a59 to bcab66d Compare March 11, 2023 03:08
@AttilaFueloep
Copy link
Contributor Author

Addressed review feedback, added ABI file and updated PR description to indicate an ABI change.

@AttilaFueloep AttilaFueloep force-pushed the zfs-refactor-fletcher branch from bcab66d to fff528d Compare March 11, 2023 18:40
@ryao
Copy link
Contributor

ryao commented Mar 12, 2023

One nit about the title. Instead of writing module/zcommon: ..., it would be more consistent to write zcommon: ....

Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Signed-off-by: Attila Fülöp <attila@fueloep.org>
@AttilaFueloep AttilaFueloep changed the title module/zcommon: Refactor FPU state handling in fletcher4 zcommon: Refactor FPU state handling in fletcher4 Mar 12, 2023
@AttilaFueloep AttilaFueloep force-pushed the zfs-refactor-fletcher branch from fff528d to 5d45089 Compare March 12, 2023 21:57
@AttilaFueloep
Copy link
Contributor Author

Agreed, changed the commit message and the PR title.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 14, 2023
@behlendorf behlendorf merged commit 78289b8 into openzfs:master Mar 14, 2023
@lundman
Copy link
Contributor

lundman commented Mar 16, 2023

Ahhh hah, so close :)

	if (ops->uses_fpu == B_TRUE) {
		kfpu_begin();
	}
	ops->init_native(&ctx);
	ops->compute_native(&ctx, buf, size);
	ops->fini_native(&ctx, zcp);
	if (ops->uses_fpu == B_TRUE) {
		kfpu_end();
	}

Since kfpu_begin() declares the variables, local to the body of the if conditional, naturally kfpu_end() won't see them. Obviously a much smaller issue to address, just got to think of a good solution.

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 16, 2023
Currently calls to kfpu_begin() and kfpu_end() are split between
the init() and fini() functions of the particular SIMD
implementation. This was done in openzfs#14247 as an optimization measure
for the ABD adapter. Unfortunately the split complicates FPU
handling on platforms that use a local FPU state buffer, like
Windows and macOS.

To ease porting, we introduce a boolean struct member in
fletcher_4_ops_t, indicating use of the FPU, and move the FPU state
handling from the SIMD implementations to the call sites.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes openzfs#14600
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants