Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add __nodiscard__ and DBResidueMask() use cases and cleanup #377

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dlmiles
Copy link
Contributor

@dlmiles dlmiles commented Feb 18, 2025

CodeQL picked this up (the function has no side-effect and return value not used, part)

TileTypeBitMask *DBResidueMask(TileType type);
/* NOTE: candidate for using a const return */

Added __nodiscard__ to function for compiler assitance.
Main problem fixed via another recent comment to remove
excessive DBResidueMask()
Reorder things in this function.
Adding 'const' in key places to provide the compiler the extra hint
for the purpose of this computation we don't change the value and the
value never changes externally even across function calls.

I'm sure the compiler (due to macro implementation of TTMaskXxxxx() calls
and visibility of data being changes) will optimise the function in
exactly the way of the reorder.

This also should have the side-effect of making clearer more auto
vectorization possibilities to the compiler or potentially replacing the
loop with (tail-call or inline) to :

simd_TTMaskSetMask_residues(lmask, rmask, TT_TECHDEPBASE, DBNumUserLayers);

Which would be a hand optimized form, that probably has an 'l_residues'
layout that favours SIMD use (2nd order copy from source of truth just in
a different data layout, such as contiguous array of TileTypeBitMask
indexed from 0, with the TileType as the index).

if (type < DBNumUserLayers)
{
TTMaskSetMask(rmask, &li->l_residues);
TTMaskSetMask(rmask, lmask);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be TTMaskCopy() rmask = *lmask;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in db3f272

@dlmiles dlmiles force-pushed the master-upstream-20250216-nodiscard branch from 94eeb9a to db3f272 Compare February 19, 2025 14:46
@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 19, 2025

Just some background on why I pick on this. I am looking at code-gen (the assembly we actually get from the compiler, to
look in the areas of auto-vectorization, but also to consider the introduction of SIMD intrinsics).

The rearrangement is to streamline the function to the compiler over what is expected (to try to give compiler less work to do by making the reading of the code a bit more clear over what transform operation is expected, previously it SetZero then did a mask OR (via TTMaskSetMask), but this is obviously just a Copy operation.

The introduction of TTMaskCopy() is to provide semantic intention, a struct assign could do, but the label allows to introduction of SIMD inline intrintics that can perform that.

So the reordering is trying to push/help the compiler see an auto-vectorization opportunity but if that fails it is possible to force the SIMD intrinsics because the label is present.

This is just background interest and many months away from results but it helps if the upstream tree can take a clean patch of just ifdef than a re-arrangement of the function at the same time.

@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 19, 2025

In case you are wondering what might looks like:

SSE4_2 which aligns with new Linux system ABI x86-64-v2 (run /lib64/ld-linux-x86-64.so.2 --help to see if you system supports it) this is currently the RHEL9 default ABI.

/* DEST = DEST AND (NOT MASK) */
void __simd_inline__
SIMD_FUNCNAME(simd_TTMaskClearMask)(TileTypeBitMask *dest, const TileTypeBitMask *mask)
{
    ASSERT_ALIGNED(dest);
    ASSERT_ALIGNED(mask);

    __m128i x0 = _mm_load_si128(memaddr0(dest));
    __m128i x1 = _mm_load_si128(memaddr0(mask));
    x0 = _mm_andnot_si128(x1, x0);      // first argument gets NOT applied
    _mm_store_si128(memaddr0(dest), x0);

    __m128i x2 = _mm_load_si128(memaddr1(dest));
    __m128i x3 = _mm_load_si128(memaddr1(mask));
    x2 = _mm_andnot_si128(x3, x2);         // first argument gets NOT applied
    _mm_store_si128(memaddr1(dest), x2);
}

Which comes out like:

00000000000001b0 <simd_TTMaskClearMask_func>:
 1b0:   c5 f9 6f 0e             vmovdqa (%rsi),%xmm1
 1b4:   c5 f1 df 07             vpandn (%rdi),%xmm1,%xmm0
 1b8:   c5 f9 7f 07             vmovdqa %xmm0,(%rdi)
 1bc:   c5 f9 6f 46 10          vmovdqa 0x10(%rsi),%xmm0
 1c1:   c5 f9 df 47 10          vpandn 0x10(%rdi),%xmm0,%xmm0
 1c6:   c5 f9 7f 47 10          vmovdqa %xmm0,0x10(%rdi)
 1cb:   c3                      retq   
 1cc:   0f 1f 40 00             nopl   0x0(%rax)

@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 19, 2025

AVX2 which aligns with new Linux system ABI x86-64-v3 (run /lib64/ld-linux-x86-64.so.2 --help to see if you system supports it)

/* DEST = DEST AND (NOT MASK) */
void __simd_inline__
SIMD_FUNCNAME(simd_TTMaskClearMask)(TileTypeBitMask *dest, const TileTypeBitMask *mask)
{
    ASSERT_ALIGNED(dest);
    ASSERT_ALIGNED(mask);

    __m256i x0 = _mm256_load_si256(memaddr0(dest));
    __m256i x1 = _mm256_load_si256(memaddr0(mask));
    x0 = _mm256_andnot_si256(x1, x0);   // first argument gets NOT applied
    _mm256_store_si256(memaddr0(dest), x0);
}
0000000000000110 <simd_TTMaskClearMask_func>:
 110:   c5 fd 6f 06             vmovdqa (%rsi),%ymm0
 114:   c5 fd df 07             vpandn (%rdi),%ymm0,%ymm0
 118:   c5 fd 7f 07             vmovdqa %ymm0,(%rdi)
 11c:   c5 f8 77                vzeroupper 
 11f:   c3                      retq   

VZEROUPPER is to reset switching between different SIMD mode penalty as the target ABI is not x86-64-v3 ABI. I believe if it was that instruction would not be there.

Note the compiler is clever enough to elide load/store (between calls to inline SIMD operations), re-order SIMD instruction, i.e. it can see through to the described intention and make better decisions for exact instruction ordering and scheduling that works to fill pipelines.

So this is reason why I maybe picking on an area such as this to see how I can make use of what is already available with current compilers.

@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 20, 2025

I would appreciate a review as-is that for all intents and purposes that the change-set here looks ok and no obvious errors were introduced.


Merge Status On hold, pending comprehensive testing framework to manage very-high-impact changes, relating to manipulation operations.

The purpose of the exercise so far (picking on a function and creating this PR) was to understand exactly what the testing framework needs to look like.

Current requirements from me is:

  • Needs to isolate a function (or limit group of functions, pluck them from the code base)
  • Needs to create harness around function (codegen, test runner, manage API call, returned data)
  • Needs to serialize input/output/argument/conditions data
  • Needs to allow randomization (and therefore fuzzing), to log output and provide re-creatable failure (from serialize data)
  • Need to be able to compare arbitrary magic versions on demand (checkout, extract target function, encapsulate in tester, run test)
  • Needs to test same compiler multiple build options -g -O -O3 -fomit-frame-pointer etc...
  • Needs to different compilers gcc-11 gcc-13 clang-17 etc..
  • Needs to allow swapping of original macros, SIMD, -march=target to confirm codegen / compiler output is also good

I have some elements of this already (that has been used to validate SIMD SEE4_2/AVX2 against standard macros), but needs to use this PR use-case to formalize into a mechanism that can first validate.
This is needed to qualify any future SIMD work as being production ready.

This in itself is a bunch of work, but it is the only way I can see testing changes in data manipulation and reaching my own quality bar to consider it usable in production. There is obviously also the run magic twice (non-SIMD and SIMD) to perform complex work and compare the final outputs generated are the same.

Merge Status: merge on hold (review status in 2 months)
Quality: ready to merge (the code as-is is just untested but otherwise sound with no further development on it expected)
Risk: very-high-impact (data manipulation operation change)
Level of Testing: none (needs extensive test as above)

Copy link
Owner

@RTimothyEdwards RTimothyEdwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the assessment that DBResidueMask() was probably meant to be DBFullResidueMask( ) to begin with, and is redundant and should be removed. Should I go ahead and make that change now?

@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 23, 2025

I agree with the assessment that DBResidueMask() was probably meant to be DBFullResidueMask( ) to begin with, and is redundant and should be removed. Should I go ahead and make that change now?

Yes make a change direct to tree and I shall rebase (drop the commit) from this PR.

@RTimothyEdwards
Copy link
Owner

@dlmiles : Done; change will be mirrored to github by tomorrow.

RTimothyEdwards added a commit that referenced this pull request Feb 24, 2025
@dlmiles dlmiles force-pushed the master-upstream-20250216-nodiscard branch from db3f272 to ff6bcec Compare February 24, 2025 10:04
@dlmiles
Copy link
Contributor Author

dlmiles commented Feb 24, 2025

dropped 1 file change in 1 commit (as 48708c5) resolves this matter.

Merge still on hold pending testing to support facilitating SIMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants