Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of Clear Linux patches #164

Open
InBetweenNames opened this issue Nov 2, 2018 · 63 comments
Open

Integration of Clear Linux patches #164

InBetweenNames opened this issue Nov 2, 2018 · 63 comments

Comments

@InBetweenNames
Copy link
Owner

Clear Linux maintains a number of performance related patches for open source projects. There are quite a few: https://github.com/clearlinux-pkgs

It would be interesting to integrate these into GentooLTO somehow, either synced into the overlay or through a user install mechanism.

@aw1cks
Copy link
Contributor

aw1cks commented Nov 2, 2018

I didn't look thoroughly, but if I'm seeing it right, these are just patches in their git repo? It would be very easy to put the patches into /etc/portage/patches, is there some downside to this?

@InBetweenNames
Copy link
Owner Author

It appears that way, however I'm guessing they also fine tune compiler options as well. Also, I'm unsure if they use icc or gcc for their builds. I'm guessing it depends on the package. At the very least, we could include the source code patches, I think.

@InBetweenNames
Copy link
Owner Author

@InBetweenNames
Copy link
Owner Author

Interesting flag: -fno-semantic-interposition. I'll be adding this one to the defaults I think.

@InBetweenNames
Copy link
Owner Author

Other interesting things: they appear to enable -ffast-math on certain packages, or flags that -ffast-math enables at the very least. Also interesting: they force all functions to be aligned on 32 byte boundaries with -falign-functions=32. I wonder if they do that for AVX512 compatibility? It'd be interesting to see what the benefits are for overaligning functions, if any.

@InBetweenNames
Copy link
Owner Author

It appears that -falign-functions=32 has some kind of impact on autovectorization. By default on x86_64, it is set to 16.

@InBetweenNames
Copy link
Owner Author

I just tested an architecture that has AVX512 instructions:

> gcc -march=skylake-avx512 -flto -Ofast -Q --help=optimizer | grep falign-functions
  -falign-functions                     [disabled]
  -falign-functions=                    16

It seems by default on x86_64, -falign-functions is always set 16 (as long as -O2 or higher is specified).

@Althorion
Copy link
Contributor

Phoronix claims that they build using GCC/Clang (differs from package to package).

@InBetweenNames
Copy link
Owner Author

Makes sense - icc probably doesn't support a lot of the GNU extensions that are used out there in the wild. Not to mention, GCC is highly competitive with ICC when the right options are used. I'm thinking I may update the recommendations for -falign-functions, not adding it by default but mentioning it in make.conf.lto. The reason being, I want to support more than just Intel processors, or even just x86_64, and this feels very much like an Intel-specific thing.

@funghetto
Copy link
Contributor

I heard that also Solus is using some of their optimizations.

@wolfwood
Copy link
Contributor

wolfwood commented Nov 2, 2018

@InBetweenNames one thing that clear linux does that isn't really necessary for us is function multiversioning. I think the linker links different functions based on eg. AVX support. since everyone here is probably building with -march=native we can get smaller binaries than Clear can, may have better LTO opportunities, etc.

It would be interesting to see if we can get Michael to bench lto-overlay vs. Clear, esp. once we steal some of their fancy tricks.

regarding -falign-functions I'd expect this to be about aligned jumps/instruction cache line reads. maybe RIP relative addressing? but once you are executing in the function instruction alignment is going to be off no matter what you do thanks to variable length instructions. even if AVX cared about instruction alignment, I'm not sure this would help.

is it possible that there is a more compact way to load immediates if they are 32-aligned, so function call sites are smaller?

@InBetweenNames
Copy link
Owner Author

Agreed -- in fact we should do even better than function multi versioning since we're compiling our system exactly tailored for the system it's running on. This means more opportunities for LTO all around. Not to mention, this should be highly portable across many architectures.

If one could link their system using mainly static libraries, I bet the LTO benefits would be even more profound. You can link-optimize across static library boundaries, and you can't do that with shared objects. I don't believe this is possible as-is however, since Portage seems to really prefer shared objects, and configure scripts, etc, also prefer shared objects.

I was wondering the same about -falign-functions today, but there are other -falign-* flags that affect those other cases you mention. I hadn't considered immediate operands however. It might make a difference if there's some static storage for a function as well. I've been looking around all day for more uses of -falign-functions=32 and I've been having serious trouble. I found a slide deck:

http://hpac.rwth-aachen.de/teaching/sem-accg-16/slides/08.Schmitz-GGC_Autovec.pdf

I also found a StackOverflow question that indirectly touches on it:

https://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed

If I pass g++ -O2 -falign-functions=16 -falign-loops=16 then everything is back to normal: I always get the fastest case and the time isn't sensitive to the -fno-omit-frame-pointer flag anymore. I can pass g++ -O2 -falign-functions=32 -falign-loops=32 or any multiples of 16, the code is not sensitive to that either.

Without delving in the GCC internals, I can't find many resources that recommend this flag. I'll see if the Intel guys will shed some light on it.

@InBetweenNames
Copy link
Owner Author

More goodies: -fno-common

Line 481: https://github.com/clearlinux/autospec/blame/master/autospec/specfiles.py

In that commit, -fno-common is mentioned.
Detailed here: https://gcc.gnu.org/onlinedocs/gcc/Code-Gen-Options.html -- seems beneficial when it would work.

I find it interesting they enable -fno-math-errno, -fno-trapping-math by default. It's not -ffast-math, but it's partway there.

>gcc -march=skylake-avx512 -flto -Ofast -Q --help=common | grep fcommon
  -fcommon                              [enabled]

Even with the most aggressive optimization package, this is on by default.

@InBetweenNames
Copy link
Owner Author

Found when it was added: https://github.com/clearlinux/autospec/blame/a5260d7ce751774d46e0a957786d179456a14275/autospec/buildpattern.py

It was added by @fenrus75 for "high speed cases". Interesting.

@InBetweenNames
Copy link
Owner Author

I notice that on packages that are optimized for size, they enable -ffunction-sections and -fdata-sections for dead code removal, along with a -Wl,--gc-sections. However, these are two flags I want to research more before enabling by default -- I'm unsure how these interact with LTO. I assumed that LTO kind of did dead code elimination on its own, since the entire program would be visible at link time (minus definitions in shared objects).

@InBetweenNames
Copy link
Owner Author

After researching a bit more, it looks like -Wl,--gc-sections is a weak form of LTO, and it is often compared to full LTO like GentooLTO uses. I'm not sure if there's a benefit to using both at the same time.

https://lwn.net/Articles/741494/

@InBetweenNames
Copy link
Owner Author

OK -- so locally, I have enabled -fno-common and -fno-semantic-interposition and have started building a few packages with them. I'll try them out for a few days before pushing them. I've also emailed the Clear Linux developers about -falign-functions=32. If it turns out to be beneficial for some systems, I will add it as a recommendation but I won't enable it by default in the overlay -- it will be opt-in behaviour.

@InBetweenNames
Copy link
Owner Author

OK -- I think i figured it out:

https://software.intel.com/en-us/forums/intel-c-compiler/topic/635646

For more info:

https://lkml.org/lkml/2015/5/19/1009

It looks like the historical reason for -falign-functions=16 is:

The instruction fetch unit can fetch a maximum of 16 bytes of code per clock cycle

From Agner Fog's docs:

https://www.agner.org/optimize/microarchitecture.pdf

See "Instruction Fetch" sections for details.

However, consider that cache lines are usually 64 bytes long -- depending on your processor. From Ingar's post:

So based on those measurements, I think we should do the exact
opposite of my original patch that reduced alignment to 1 bytes, and
increase kernel function address alignment from 16 bytes to the
natural cache line size (64 bytes on modern CPUs).

As for why -falign-functions=32 was chosen? I have a feeling it's actually a compromise. See this reply by Linus: https://lkml.org/lkml/2015/5/19/1142

Is there some way to get gcc to take the size of the function into
account? Because aligning a 16-byte or 32-byte function on a 64-byte
alignment is just criminally nasty and wasteful.

So, for functions that are greater than the cache line size, aligning on a a cache line boundary makes the most sense. For functions that are less than the cache line size, this isn't ideal as it wastes I$ space. Of course, when inlining is taken into account, which is a much higher probability since we are using system-wide LTO, this whole discussion becomes moot. However, this is still a problem for shared objects.

Ideally, GCC/ld would be smarter about how it aligns functions.

Work has been done to this end: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240
It has been merged in trunk!

So, looking at -falign-functions once again:

Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance, -falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 aligns to the next 32-byte boundary only if this can be done by skipping 23 bytes or less.

In other words, -falign-functions=24 will align all functions to 32-byte boundaries except those that are 8 bytes in size or less.

And another goodie -flimit-function-alignment:

If this option is enabled, the compiler tries to avoid unnecessarily overaligning functions. It attempts to instruct the assembler to align by the amount specified by -falign-functions, but not to skip more bytes than the size of the function.

This flag is off by default!

Delving in the GCC source code in file gcc/config/i386/x86-64.h:

#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)                    \
  do {                                                                  \
    if ((LOG) != 0) {                                                   \
      if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));  \
      else {                                                            \
        fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));     \
        /* Make sure that we have at least 8 byte alignment if > 8 byte \
           alignment is preferred.  */                                  \
        if ((LOG) > 3                                                   \
            && (1 << (LOG)) > ((MAX_SKIP) + 1)                          \
            && (MAX_SKIP) >= 7)                                         \
          fputs ("\t.p2align 3\n", (FILE));                             \
      }                                                                 \
    }                                                                   \
  } while (0)

and the calling code in gcc/varasm.h:

...
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      int align_log = align_functions_log;
#endif
      int max_skip = align_functions - 1;
      if (flag_limit_function_alignment && crtl->max_insn_address > 0
          && max_skip >= crtl->max_insn_address)
        max_skip = crtl->max_insn_address - 1;

#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
#else
      ASM_OUTPUT_ALIGN (asm_out_file, align_functions_log);
#endif
    }

So in the worst case, we still get 8-byte function alignment for functions that are smaller than falign_functions in size. So, with the default, you get at most 16 bytes alignment and at least 8 bytes alignment with -flimit-function-alignment. It would probably make more sense to make it the L1 cache line size bytes by default and at least 16 bytes with -flimit-function-alignment. This is a pretty trivial change to make:

#define ASM_OUTPUT_MAX_SKIP_ALIGN(FILE,LOG,MAX_SKIP)                    \
  do {                                                                  \
    if ((LOG) != 0) {                                                   \
      if ((MAX_SKIP) == 0) fprintf ((FILE), "\t.p2align %d\n", (LOG));  \
      else {                                                            \
      fprintf ((FILE), "\t.p2align %d,,%d\n", (LOG), (MAX_SKIP));       \
        if ((1 << (LOG)) > ((MAX_SKIP) + 1))                            \
        {                                                               \
        /* Make sure that we have at least 16 byte alignment            \
           if > 16 byte alignment is preferred.  */                     \
          if ((LOG) > 4 && (MAX_SKIP) >= 15)                            \
            fputs ("\t.p2align 4\n", (FILE));                           \
        /* Make sure that we have at least 8 byte alignment if > 8 byte \
           alignment is preferred.  */                                  \
          else if ((LOG) > 3 && (MAX_SKIP) >= 7)                        \
            fputs ("\t.p2align 3\n", (FILE));                           \
        }                                                               \
      }                                                                 \
    }                                                                   \
  } while (0)

The above should guarantee the following, for a function that takes b bytes, with -falign-functions=n and -flimit-function-alignment:

  • If b >= n: for sure will be aligned to n
  • If n > 16 and 16 <= b < n: will be at least aligned to a 16 byte boundary
  • Otherwise, if n > 8 and 8 <= b < n: will be at least aligned to a 8 byte boundary

The check is done in this order to prevent wasting space.

So, it seems to me we should be using -falign-functions=${L1ICACHELINESIZE} -flimit-function-alignment.

I will test out my GCC patch and if it works OK, I will submit it upstream.

@wolfwood
Copy link
Contributor

wolfwood commented Nov 3, 2018 via email

@InBetweenNames
Copy link
Owner Author

Heh, of course the GCC devs beat me to the punch.

gcc-mirror/gcc@bc9f52f

It looks like they are reworking the -flimit-function-alignment stuff in the next GCC version, given the commit message.

@InBetweenNames
Copy link
Owner Author

InBetweenNames commented Nov 3, 2018

Thanks! In GCC trunk, we have this nice thing:

  {
    /* N2[:M2] is not specified.  This arch has a default for N2.
       Before -falign-foo=N:M:N2:M2 was introduced, x86 had a tweak.
       -falign-functions=N with N > 8 was adding secondary alignment.
       -falign-functions=10 was emitting this before every function:
      .p2align 4,,9
      .p2align 3
       Now this behavior (and more) can be explicitly requested:
       -falign-functions=16:10:8
       Retain old behavior if N2 is missing: */

So, we may be able to say something like -falign-functions=64:48:16:8 which should:

  • Align to 64 bytes if it can be done by skipping 48 bytes or less
  • Align to 16 bytes if it can be done by skipping 8 bytes or less

Obviously these values would need to be tweaked. But it would give the desired result at least.

This doesn't appear to be documented anywhere however.

@InBetweenNames
Copy link
Owner Author

Okay, it is documented and I simply didn't look hard enough. It's hard to retain the old behaviour with the new method, since the secondary alignment will only be triggered if -flimit-function-alignment is not passed in. Sigh.

@InBetweenNames
Copy link
Owner Author

Got a response from Arjan van de Ven!

without going into too many cpu microarchitecture details... Intel cpus like hot code to start at a 32 byte boundary.

Very interesting. So, Ingo's findings confirm this to a degree, and suggest even stronger alignment requirements are beneficial. He says it best here: https://lkml.org/lkml/2015/5/21/443
I think, with -falign-functions=n and -flimit-function-alignment we get 90% of the way there, actually:

  • Functions greater than n bytes are aligned to an n byte boundary
  • Functions less than n bytes are tightly packed, unless they will cross a n byte boundary

For fun, here's an attempt to restore the functionality in my previous patch on GCC trunk:

  /* Handle a user-specified function alignment.
     Note that we still need to align to DECL_ALIGN, as above,
     because ASM_OUTPUT_MAX_SKIP_ALIGN might not do any alignment at all.  */
  if (! DECL_USER_ALIGN (decl)
      && align_functions.levels[0].log > align
      && optimize_function_for_speed_p (cfun))
    {
#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
      int max_skip1 = align_functions.levels[0].maxskip;
      int max_skip2 = align_functions.levels[1].maxskip;
      if (flag_limit_function_alignment)
      {
        if (crtl->max_insn_address > 0
          && max_skip1 >= crtl->max_insn_address)
        max_skip1 = crtl->max_insn_address - 1;
        
        if (crtl->max_insn_address > 0
          && max_skip2 >= crtl->max_insn_address)
        max_skip2 = crtl->max_insn_address - 1;
      }
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
                           align_functions.levels[0].log,
                           max_skip1);
      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
                           align_functions.levels[1].log,
                           max_skip2);
#else
      ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
#endif
    }

So, with m < n, -falign-functions=n:n:m:m -flimit-function-alignment, for a b byte function would:

  • If n <= b, will be at least aligned to an n byte boundary
  • If m <= b < n, will be at least aligned to a m byte boundary
  • If b < m, if the function would cross the boundary m, it will be aligned to m
  • Otherwise, will use target default function alignment (unknown what this is defined as in GCC, but I suspect for x86_64 it is either 8 or 16 -- if anyone knows please let me know). If this is 0, then it's tightly packed.

Examples for the above would be n = 64 or n = 32 and m = 16

Obviously such a scheme would need benchmarks to show it's worth doing over the default. It could potentially waste space, too, since a function with m <= b < n may align to an n boundary, instead of a potentially closer m boundary. It's too bad Ingo's scheme is too hard to implement in a quick patch, as I've love to test his out.

Regardless of whether the default schemes or the one I posted above is used, based on what we have seen, n should be either 64 or 32 for Intel processors, and we may or may not want to tightly pack small functions with -flimit-function-alignment. We'd need benchmarks to show for sure what's worth enabling, but I think it's safe to go with Arjan van de Ven's choice of -falign-functions=32 for Intel processors and not tightly packing functions in the meantime. I will update README.md accordingly.

As I find this issue to be very interesting, I'd like to leave it up for discussion, especially in the hopes we get some benchmarks using combinations of these flags. My diff against GCC trunk for my own alignment scheme is below, in case anyone wants to try it on GCC trunk:

diff --git a/gcc/varasm.c b/gcc/varasm.c
index 545e13fef6a..6ed87298ec9 100644
--- a/gcc/varasm.c
+++ b/gcc/varasm.c
@@ -1809,19 +1809,24 @@ assemble_start_function (tree decl, const char *fnname)
       && optimize_function_for_speed_p (cfun))
     {
 #ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-      int align_log = align_functions.levels[0].log;
-#endif
-      int max_skip = align_functions.levels[0].maxskip;
-      if (flag_limit_function_alignment && crtl->max_insn_address > 0
-	  && max_skip >= crtl->max_insn_address)
-	max_skip = crtl->max_insn_address - 1;
+      int max_skip1 = align_functions.levels[0].maxskip;
+      int max_skip2 = align_functions.levels[1].maxskip;
+      if (flag_limit_function_alignment)
+      {
+        if (crtl->max_insn_address > 0
+          && max_skip1 >= crtl->max_insn_address)
+        max_skip1 = crtl->max_insn_address - 1;
 
-#ifdef ASM_OUTPUT_MAX_SKIP_ALIGN
-      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file, align_log, max_skip);
-      if (max_skip == align_functions.levels[0].maxskip)
-	ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
-				   align_functions.levels[1].log,
-				   align_functions.levels[1].maxskip);
+        if (crtl->max_insn_address > 0
+          && max_skip2 >= crtl->max_insn_address)
+        max_skip2 = crtl->max_insn_address - 1;
+      }
+      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+                           align_functions.levels[0].log,
+                           max_skip1);
+      ASM_OUTPUT_MAX_SKIP_ALIGN (asm_out_file,
+                           align_functions.levels[1].log,
+                           max_skip2);
 #else
       ASM_OUTPUT_ALIGN (asm_out_file, align_functions.levels[0].log);
 #endif

InBetweenNames added a commit that referenced this issue Nov 4, 2018
Enable -fno-semantic-interposition by default
Add off by default define for -fno-common
Reference #164
@sjnewbury
Copy link

In addition to the function alignment, there's also data-alignment which takes a cacheline option:
-malign-data=cacheline

I also build my system with:
-mtls-dialect=gnu2

@sjnewbury
Copy link

The Clear Linux "fast-math" options can be very beneficial to auto-vectorization, it gives many more opportunities than the default IEEE754 strict compliance.

@InBetweenNames
Copy link
Owner Author

@sjnewbury:
-mtls-dialect is a nice one!
-malign-data=cacheline too.

I got the impression that -malign-data, if changed, may not be compatible with code compiled with GCC 4.8 or older. Do you know if this one affects binary compatibility? If so, this would mostly affect closed source software, and possibly users of -bin packages. I see a number of recommendations for high performance code to use malign-data=cacheline, so this may be a non-issue at this time.

I've been hemming and hawing about the strict IEEE compliance myself, and I've decided we can support it as an opt-in enhancement. I know some users of this overlay are using it for scientific computations, and I don't want to interfere with that automatically.

@InBetweenNames
Copy link
Owner Author

One more thing: is malign-data documented in detail anywhere? I can't find much in the official GCC docs. I see references to Clear Linux using -malign-data=abi at one point. If necessary, I'll go look through the GCC code again.

@justanerd
Copy link

justanerd commented Nov 5, 2018

I use the code from this bug report to benchmark -falign-functions: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58863
I just compile it with my cflags and then time ./align 32 32
When I see a difference I compile php 7.2 and run phpbench to see If I'm getting any benefit.
For a 7900X and gcc 8.2(with clear linux patches applied) -falign-functions=8 seems to be the fastest.

Also you should look into --param inline-unit-growth=5 --param max-unroll-times=2
http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html
https://www.phoronix.com/forums/forum/software/programming-compilers/47966-intel-broadwell-gcc-4-9-vs-llvm-clang-3-5-compiler-benchmarks/page2
These two are still providing some benefit.
-funroll-loops should be set with --param max-unroll-times=2 to get the improvement

@InBetweenNames
Copy link
Owner Author

Well, I don't want to use -funroll-loops and friends because those override the compiler's judgement. Even when you add in --param max-unroll-times=2 --param inline-unit-growth=5, you're still telling the compiler to do something unconditionally. Certain packages may benefit, but the idea is we should be letting the compiler decide what to do. Hence why -falign-functions=32 is documented as being an optional thing for Intel chips, based on my conversation with a Clear Linux developer.

Now of course, we may want to enable these flags on a per-package basis where improvement has been proven via benchmarks (as with your php example). Otherwise, it should be the defaults with possibly a few optional tweaks on a per package basis.

@gcs-github
Copy link
Contributor

Interesting. I didn't get any error using the GNU ld (BFD), but I'm getting an error too on eudev if I switch my linker to gold. Not reproducing the issue on python however.

@jelinekto
Copy link
Contributor

Ah, of course switching the linker didn't occur to me. btrfs-progs and eudev indeed do build for me with -fuse-ld=bfd.

@jelinekto
Copy link
Contributor

As for python, turns out I get the error only when combining gold and pgo.

@InBetweenNames
Copy link
Owner Author

This is all very good to know. Definitely will hold off on -mtls-dialect=gnu2 by default for a bit.

@sjnewbury
Copy link

sjnewbury commented Nov 21, 2018

FWIW -mtls-dialect=gnu2 also requires a glibc patch to make it work with a patched prelink... Yeah, I'm the last user of prelink! ;-)

(Currently building gentooLTO+x32+auto-prelink+autopar+jemalloc)

@aw1cks
Copy link
Contributor

aw1cks commented Dec 4, 2018

Just a heads up, I'm considering creating some ebuilds with the Clear Linux patches in my overlay. I tested their kernel patches today to great success (reduced boot time by about 25%) so I will create an ebuild for at least the patched kernel and maybe some other packages, depending on how much success I have with them.

@sjnewbury
Copy link

@aw1cks I see they change the perf_bias to default to performance instead of normal. How much does this account for? This setting makes a big difference on my Ivy Bridge laptop, but I have it set to toggle on power events to maximise battery life.

@aw1cks
Copy link
Contributor

aw1cks commented Dec 5, 2018

@sjnewbury I don't use it on a laptop. I haven't got round to rebuilding my laptop with gentoo yet, but on my desktop I didn't use all the patches but rather the ones relating to boot speed & performance without much regard for the patches claiming to reduce wakelocks. As far as I can tell, in their use case with Clear Linux the difference in power is more than offset by the other tweaks which they have made. How much of this extra battery life comes from their kernel, I couldn't say, seeing as they have custom patches applied to many userland applications and even gcc itself. You would have to benchmark it to know for sure. If you want to test the kernel yourself, you can use any 4.19 series kernel and put the patches into /etc/portage/patches/sys-kernel/${KERNEL_SOURCE_PKG_NAME}-${KERNEL_VERSION}/. The patches are available here (they also have their boot parameters in a text file in this repository, worth trying perhaps). Just as a note, I do hope to eventually create ebuilds which DO include their userland patches for various programs available as a useflag, and then in that case we can make a fair comparison. Additionally, I do wonder if the use of systemd vs openRC could play a role here (in my experience, I have had higher power consumption when using init systems other than systemd - maybe it's something I'm doing wrong, I couldn't tell you) . If you do find anything out, please let me know as I'm quite interested in this myself for my laptop.

@fenrus75
Copy link

fenrus75 commented Dec 5, 2018 via email

@aw1cks
Copy link
Contributor

aw1cks commented Dec 5, 2018

@fenrus75 thanks for pointing in the right direction. How can I build this package without Clear Linux userspace tools? I can't find any binaries in the repository, nor any of the releases, and the Makefile references a file not included in the repository.

@fenrus75
Copy link

fenrus75 commented Dec 5, 2018 via email

@aw1cks
Copy link
Contributor

aw1cks commented Dec 5, 2018

Great, thanks. However I'm having an issue with autoconf.

[alex@xps13 clr-power-tweaks-174]$ autoconf
configure.ac:6: error: possibly undefined macro: AM_INIT_AUTOMAKE
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
configure.ac:7: error: possibly undefined macro: AM_SILENT_RULES

Some missing include maybe?

@gcs-github
Copy link
Contributor

Leaving a link to this new Phoronix post here, benchmarking some of the performance gains from Clear Linux, to give some extra context to this issue and get some idea of we can hope for from following up: https://www.phoronix.com/scan.php?page=article&item=clear-faster-blas&num=1

@javashin
Copy link

javashin commented Apr 9, 2019

FWIW -mtls-dialect=gnu2 also requires a glibc patch to make it work with a patched prelink... Yeah, I'm the last user of prelink! ;-)

(Currently building gentooLTO+x32+auto-prelink+autopar+jemalloc)

im prelinking gentoo too with my new install gentoo nomultilib lto nopie nossp

@jelinekto
Copy link
Contributor

@InBetweenNames Did you by any chance look into -fdata-sections -ffunction-sections -Wl,--gc-sections further? I've enabled those globally couple months ago and while I'm not sure there's a clean benefit (couldn't directly compare binary sizes with my previous build as I changed some other things as well), my system does not appear to be broken.

Even if it's not worth enabling globally, perhaps packages that can't be build with full LTO could benefit from something like /"${FLTO}"/"${GCSECTIONS}"?

@elsandosgrande
Copy link
Contributor

@jelinekto Umm, https://stackoverflow.com/questions/4274804/query-on-ffunction-section-fdata-sections-options-of-gcc . This is not as straightforward as you might think.

jiblime added a commit to jiblime/gentooLTO that referenced this issue Sep 11, 2020
Integration of Clear Linux's 'multi-thread-default.patch'

By default, zstd uses one core for compression. This patch
makes zstd use all physical cores detected for compression,
increasing performance and reducing compression time.

Below's results are from using zstd's built-in benchmark,
showing a decrease of 78.13% in compression time with -T0.

Default (-T1)
	19#linux-5.8.tar     : 983869440 -> 121381009 (8.106),  4.05 MB/s ,1771.3 MB/s
	zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar  244.99s user 0.46s system 99% cpu 4:05.47 total

Patched default (-T0)
	19#linux-5.8.tar     : 983869440 -> 121384544 (8.105),  19.2 MB/s ,1756.7 MB/s
	zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar  297.19s user 0.63s system 554% cpu 53.692 total

Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch
References: InBetweenNames#164
jiblime added a commit to jiblime/gentooLTO that referenced this issue Sep 11, 2020
Integration of Clear Linux's 'multi-thread-default.patch'

By default, zstd uses one core for compression. This patch
makes zstd use all physical cores detected for compression,
increasing performance and reducing compression time.

Below's results are from using zstd's built-in benchmark,
showing a decrease of 78.13% in compression time with -T0.
The change is from 1 core to 6 (physical) cores and will
differ based on machine and file contents.

Default (-T1)
	19#linux-5.8.tar     : 983869440 -> 121381009 (8.106),  4.05 MB/s ,1771.3 MB/s
	zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar  244.99s user 0.46s system 99% cpu 4:05.47 total

Patched default (-T0)
	19#linux-5.8.tar     : 983869440 -> 121384544 (8.105),  19.2 MB/s ,1756.7 MB/s
	zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar  297.19s user 0.63s system 554% cpu 53.692 total

Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch
References: InBetweenNames#164
jiblime added a commit to jiblime/gentooLTO that referenced this issue Sep 11, 2020
Integration of Clear Linux's 'multi-thread-default.patch'

By default, zstd uses one core for compression. This patch
makes zstd use all physical cores detected for compression,
increasing performance and reducing compression time.

Below's results are from using zstd's built-in benchmark,
showing a decrease of 78.13% in compression time with -T0.
The benefit is only apparent if compression is CPU-bound
and will differ based on machine and file contents.

Default (-T1)
	19#linux-5.8.tar     : 983869440 -> 121381009 (8.106),  4.05 MB/s ,1771.3 MB/s
	zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar  244.99s user 0.46s system 99% cpu 4:05.47 total

Patched default (-T0)
	19#linux-5.8.tar     : 983869440 -> 121384544 (8.105),  19.2 MB/s ,1756.7 MB/s
	zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar  297.19s user 0.63s system 554% cpu 53.692 total

Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch
References: InBetweenNames#164
InBetweenNames pushed a commit that referenced this issue Sep 11, 2020
Integration of Clear Linux's 'multi-thread-default.patch'

By default, zstd uses one core for compression. This patch
makes zstd use all physical cores detected for compression,
increasing performance and reducing compression time.

Below's results are from using zstd's built-in benchmark,
showing a decrease of 78.13% in compression time with -T0.
The benefit is only apparent if compression is CPU-bound
and will differ based on machine and file contents.

Default (-T1)
	19#linux-5.8.tar     : 983869440 -> 121381009 (8.106),  4.05 MB/s ,1771.3 MB/s
	zstd -T1 -b19 -i0 --priority=rt linux-5.8.tar  244.99s user 0.46s system 99% cpu 4:05.47 total

Patched default (-T0)
	19#linux-5.8.tar     : 983869440 -> 121384544 (8.105),  19.2 MB/s ,1756.7 MB/s
	zstd -T0 -b19 -i0 --priority=rt linux-5.8.tar  297.19s user 0.63s system 554% cpu 53.692 total

Source: https://github.com/clearlinux-pkgs/zstd/blob/1.4.5-60/multi-thread-default.patch
References: #164
@eternal-sorrow
Copy link

So, let's make it straight: is -falign-functions=32 beneficial on AMD CPUs? Or only Intel?

@JustArchi
Copy link

JustArchi commented Mar 6, 2022

I'm late to the party but I was having fun with your awesome LTO patches and various flags today, testing them with sysbench.

On my intel i7 7700k, -falign-functions=32 degrades sysbench cpu run results (total number of events) from around 15.5k to barely 13.9-14k. For comparison, I also did -falign-functions=8 and that resulted in around 14.3k result, so once again heavily downgrading the result, but to less extent than 32. I made triple sure I'm testing and interpreting stuff in correct way, emerging sysbench app (exclusively) after every change and running several times while ensuring everything in background is as silent as possible. The flags I've used were current gentooLTO as of today with only -march=native added, so implicitly -O3 and all lto/graphite optimizations.

Now I know this is one, very specific, maybe even a bit stupid benchmark which I used to test those flags, but it's definitely not universal to say that newer intels should use 32 globally. Maybe there are benchmarks or other apps where it's beneficial, I don't doubt that, but there is at least one (and from I read more than one) place where it heavily degrades the performance, so much that it degrades the result all the way to -O2 (without -march), which is clocking around 13.9-14k as well.

Just my 3 cents, maybe it'll help somebody, maybe it won't. I suggest running benchmarks to verify whether the flag is helping or not. Personally I dug it up due to the fact that after applying LTO flags the benchmark dropped by approx 10% compared to just -O2 -march=native, and I was looking for the cause - turns out it was -falign-functions=32. There is a chance that this benchmark could be flawed and it'd be exception rather than the rule, but I'd be very doubtful regarding that - once I get some time and motivation I might test other benchmarks just to compare the results.

@RaphMad
Copy link

RaphMad commented Nov 22, 2022

@JustArchi Very interesting observation, I was contemplating about whether -falign-functions=32 is worth it for my i7-4790K today.

It may just be a fluke in the testing patterns of sysbench, but I guess even with all the research @InBetweenNames has done, it seems that this optimization really depends on a combination of workload and CPU-internal optimizations/alignment-/cache-assumptions.

@firasuke
Copy link

firasuke commented May 1, 2023

Is -falign-functions=32 actually profitable? Some research lead me to:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests