NUMA memory replication for NNUE weights #5285

Sopel97 · 2024-05-23T13:06:24Z

This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc.

Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies.

A new UCI option NumaPolicy is introduced. It can take the following values:

system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node
none - assumes there is 1 NUMA node, never binds threads
auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count)
[[custom]] - 
  // ':'-separated numa nodes
  // ','-separated cpu indices
  // supports "first-last" range syntax for cpu indices, 
  for example '0-15,32-47:16-31,48-63'

Setting NumaPolicy forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT.

The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better.

Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided.

On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the system and auto settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly.

We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary.

MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway.

Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux.

Linux is supported, without libnuma requirement. lscpu is expected.

Thanks to @Disservin for making this codebase actually nice to work on. I did not swear once.

vondele · 2024-05-23T18:39:30Z

Some quite extensive testing here, showing a very big impact on the relevant systems I can access. I use a 1 minute search from the startpos as a reference and test in 3 configurations. 1 in which the default system numa policies for memory allocation are used (i.e. if you execute as a user from the commandline), and 2 in which we execute SF under numa control for some added understanding.

First is a dual socket (8 numa domain) system, 256 threads, measured average nps (18 runs, 40GB hash, 60s search):

===== default setup =====
master:  34204093
numa  :  92395667
===== under numactl --interleave =====
master:  90750582
numa  :  90127273
===== under numactl --localalloc =====
master:  34118968
numa  :  92882137

So 2.70x speedup for the average user (default setup master vs default setup numa), and more or less equivalent to what an experienced user with numactl would achieve. The gain over the explicit numactl is small (2%, might be noise).

Second is a quad socket (4 numa domain), 288 threads system, measured average nps (10 runs, 40GB hash, 60s search):

===== default setup =====
master:  107007301
numa  :  200065338
===== under numactl --interleave =====
master:  176905043
numa  :  168109122
===== under numactl --localalloc =====
master:  107393278
numa  :  200693449

here the speedup is 1.86x for the average user, and 1.13x compared to a user that explicitly uses numactl.

So, these are impressive results. They might not translate to all system though, most likely needs to be memory BW limited (i.e. many cores, multiple sockets).

vondele · 2024-05-23T18:55:08Z

So, some work on the PR still needed

some debug output is still left (i.e. information on what's going on)
some CI failures (32 bits systems, IWYU)
formatting
possibly some additional tests in the CI testing that exercise at least the none,auto, and system options.
We also need to measure what the impact of the PR is on systems without numa (i.e. fishtest). Given the very large positive impact on some system, it might be sufficient to just measure the Elo impact instead of our usual non-regression SPRT.
Right now, it looks like if there is a failure to execute and parse the output of lscpu, there is an exit, and the user explicitly has to set numaPolicy to none. I think it is better to default to none on error. lscpu is linux specific, some other OS (e.g. standard unix/posix) might not have that.
Additional testing on old and other systems (e.g. older windows, or android).

src/numa.h

mstembera · 2024-05-25T22:26:39Z

If I understand this duplicates the instances of the global weights for each numa node. Does it also make sure the accumulator cache refreshTable (stored in the Worker class) is allocated on the same numa node as the corresponding thread accessing that refreshTable? The reason I ask is because it looks like the TCEC slowdown started after the Finny caches were added.

Sopel97 · 2024-05-25T22:34:15Z

If I understand this duplicates the instances of the global weights for each numa node. Does it also make sure the accumulator cache refreshTable (stored in the Worker class) is allocated on the same numa node as the corresponding thread accessing that refreshTable? The reason I ask is because it looks like the TCEC slowdown started after the Finny caches were added.

The Worker in a thread is now allocated after the thread is bound (if it's bound) to a NUMA node, so under first touch local alloc policy it will be local to the NUMA node.

vondele · 2024-05-27T05:51:05Z

If I understand this duplicates the instances of the global weights for each numa node. Does it also make sure the accumulator cache refreshTable (stored in the Worker class) is allocated on the same numa node as the corresponding thread accessing that refreshTable? The reason I ask is because it looks like the TCEC slowdown started after the Finny caches were added.

Prior to this PR, we had quite some discussion on discord. The slowndown kicks in when the traffic between the numa domains becomes unsustainable, which depends strongly on the hardware. We did some measurements before this patch on dual socket epyc, with a 'proxy' numa SF, namely running a single multithreaded SF on all cores (and measuring the nodes searched in 100s), and running multiple SF (at proportionally reduced threads and hash), each bound to a subset of cores (using taskset) and summing their independently searched node count (replicated and distributed in this table below). As you can see, this slowdown has been ongoing for a while, basically all net arch changes increased the memory traffic. The accumulator caches did as well, and maybe that tipped the balance on the tcec hardware, but it is not the sole cause.

                        sha                                             date                       replicated    distributed
                        dcb02337844d71e56df57b9a8ba17646f953711c        2024-05-15T16:27:03+02:00   8285760750   3228951636
                        49ef4c935a5cb0e4d94096e6354caa06b36b3e3c        2024-04-24T18:38:20+02:00   8512650027   3205113760
                        0716b845fdef8a20102b07eaec074b8da8162523        2024-04-02T08:49:48+02:00   8031071051   3947825697
                        bd579ab5d1a931a09a62f2ed33b5149ada7bc65f        2024-03-07T19:53:48+01:00   9402191002   5628777307
                        e67cc979fd2c0e66dfc2b2f2daa0117458cfc462        2024-02-24T18:15:04+01:00   9385258889   5637709761
                        8e75548f2a10969c1c9211056999efbcebe63f9a        2024-02-17T17:11:46+01:00   9370448391   6229551045
                        6deb88728fb141e853243c2873ad0cda4dd19320        2024-01-08T18:34:36+01:00   9270829179   6294669956
                        f12035c88c58a5fd568d26cde9868f73a8d7b839        2023-12-30T11:08:03+01:00   9298668878   6480843963
                        afe7f4d9b0c5e1a1aa224484d2cd9e04c7f099b9        2023-09-29T22:30:27+02:00   9116545925   7621648799
                        70ba9de85cddc5460b1ec53e0a99bee271e26ece        2023-09-22T19:26:16+02:00   9232522499   7934577768
                        3d1b067d853d6e8cc22cf18c1abb4cd9833dd38f        2023-09-11T22:37:39+02:00  10745625065  10080548329
                        e699fee513ce26b3794ac43d08826c89106e10ea        2023-07-06T23:03:58+02:00  10314206653   9200909264
                        915532181f11812c80ef0b57bc018de4ea2155ec        2023-07-01T13:34:30+02:00  10172981818   8601868764
                        ef94f77f8c827a2395f1c40f53311a3b1f20bc5b        2023-07-01T12:59:28+02:00  12178040219  12383770094
                        68e1e9b3811e16cad014b590d7443b9063b3eb52        2023-06-29T08:00:10+02:00  12224782828  12547603344
                        a49b3ba7ed5d9be9151c8ceb5eed40efe3387c75        2023-06-22T10:33:19+02:00  12121675593  12534934893
                        932f5a2d657c846c282adcf2051faef7ca17ae15        2023-06-11T15:23:52+02:00  10867654008  11345689421
                        373359b44d0947cce2628a9a8c9b432a458615a8        2023-06-06T21:17:36+02:00  10815824714  11234294735
                        c1fff71650e2f8bf5a2d63bdc043161cdfe8e460        2023-05-31T08:51:22+02:00  10701105666  11037076947
                        41f50b2c83a0ba36a2b9c507c1783e57c9b13485        2023-04-25T08:19:00+02:00  13868959994  14372295914
                        758f9c9350abee36a5865ec701560db8ea62004d        2022-12-04T14:17:15+01:00  13717955194  14086251908
                        e6e324eb28fd49c1fc44b3b65784f85a773ec61c        2022-04-18T22:03:20+02:00  12800300761  12783531121
                        7262fd5d14810b7b495b5038e348a448fda1bcc3        2021-10-28T07:38:19+02:00  13316077078  13356783925

see also https://discord.com/channels/435943710472011776/813919248455827515/1240909395433619497

vondele · 2024-05-27T06:16:50Z

Some more discussion in discord, some vtune measurements to show obvious improvement. (sqrmax)
https://discord.com/channels/435943710472011776/813919248455827515/1244454210028834866

Good improvement in numa BW usage

Disservin · 2024-05-27T19:17:35Z

fixed some ci issues and made some code style more consisten, also made std::thread::hardware_concurrency() be saved in SYSTEM_THREADS_NB and aliased any mention of the number 64 in regards to windows in PROC_IN_GRP_NB, if you have better suiting names in mind let me know (or just change it)

vondele · 2024-05-27T21:26:22Z

on hardware where this patch is really beneficial

passed 60+1 @ 256t 16000MB hash
https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8
LLR: 2.95 (-2.94,2.94) <0.00,10.00>
Total: 278 W: 110 L: 29 D: 139
Ptnml(0-2): 0, 1, 56, 82, 0

vondele · 2024-05-28T05:12:59Z

normal case, where the patch should not kick in, passes no regression:

passed SMP STC
https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 67152 W: 17354 L: 17177 D: 32621
Ptnml(0-2): 64, 7428, 18408, 7619, 57

passed STC
https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 131648 W: 34155 L: 34045 D: 63448
Ptnml(0-2): 426, 13878, 37096, 14008, 416

However, some issues compiling (profile-building) the code showed up:
https://tests.stockfishchess.org/actions?action=failed_task&user=&text=sopel97
https://tests.stockfishchess.org/actions?action=failed_task&user=linrock&text=sopel97

The latter need to be understood before merging.

snicolet · 2024-05-28T07:15:20Z

great job, congrats Sopel !

Sopel97 · 2024-05-28T11:54:35Z

Final touches made. Android excluded. NumaReplicatedBase no longer requires storing a unique ID. Made the condition for doBindThreads clearer by making it structural. A few comments.

Leaving this commit separately for now for easier review. Will squash everything once there's approval.

vondele · 2024-05-28T12:57:52Z

looks good to me, please fix include and squash, please with a commit message ready for merging including the testing results.

I'll try a bit later today if android and the corner case we identified above are working as advertised, but otherwise this seems ready for a merge.

Sopel97 · 2024-05-28T13:17:56Z

However, some issues compiling (profile-building) the code showed up:
https://tests.stockfishchess.org/actions?action=failed_task&user=&text=sopel97
https://tests.stockfishchess.org/actions?action=failed_task&user=linrock&text=sopel97

regarding this, can't see the actual error, so no way to fix it really, whatever it is

IWYU wants me to include a non-standard header, what to do?

Disservin · 2024-05-28T13:41:21Z

does including #include <system_error> fix it?

Sopel97 · 2024-05-28T13:42:30Z

does including #include <system_error> fix it?

it is included

misc.cpp should add these lines:
#include <__system_error/errc.h>  // for std::errc
misc.cpp should remove these lines:
- #include <system_error>  // lines 60-60

Disservin · 2024-05-28T14:02:52Z

if it keeps complaining about something silly keep the import with // IWYU pragma: keep behind the include line

vondele · 2024-05-28T14:03:07Z

However, some issues compiling (profile-building) the code showed up:
https://tests.stockfishchess.org/actions?action=failed_task&user=&text=sopel97
https://tests.stockfishchess.org/actions?action=failed_task&user=linrock&text=sopel97

regarding this, can't see the actual error, so no way to fix it really, whatever it is

on discord:


$ taskset --cpu-list 280-287 ~/Stockfish/src/stockfish uci
Stockfish dev-20240527-ae1140d9 by the Stockfish developers (see AUTHORS file)
$

seems to happen when taskset limits the cpus to only cpus that are not on numa domain 0. In that case, the code seems to exit.

Sopel97 · 2024-05-28T14:59:35Z

sorry for making a bit of a mess, didn't realize I was pushing to the PR branch for testing

However, some issues compiling (profile-building) the code showed up:
https://tests.stockfishchess.org/actions?action=failed_task&user=&text=sopel97
https://tests.stockfishchess.org/actions?action=failed_task&user=linrock&text=sopel97

The latter need to be understood before merging.

The bug was caused by empty NUMA nodes. Instead of handling them (tried it, gets messy, and still couldn't hunt some issues) I decided it would be better to completely remove empty NUMA nodes, as we don't guarantee the indices match anyway.

I can't replicate the issue any longer. It used to std::exit due to inconsistent state whenever taskset was used to run on a specific node, causing node 0 to appear empty.

…nsure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes official-stockfish#5253

Allow for NUMA memory replication for NNUE weights. Bind threads to ensure execution on a specific NUMA node. This patch introduces NUMA memory replication, currently only utilized for the NNUE weights. Along with it comes all machinery required to identify NUMA nodes and bind threads to specific processors/nodes. It also comes with small changes to Thread and ThreadPool to allow easier execution of custom functions on the designated thread. Old thread binding (WinProcGroup) machinery is removed because it's incompatible with this patch. Small changes to unrelated parts of the code were made to ensure correctness, like some classes being made unmovable, raw pointers replaced with unique_ptr. etc. Windows 7 and Windows 10 is partially supported. Windows 11 is fully supported. Linux is fully supported, with explicit exclusion of Android. No additional dependencies. ----------------- A new UCI option `NumaPolicy` is introduced. It can take the following values: ``` system - gathers NUMA node information from the system (lscpu or windows api), for each threads binds it to a single NUMA node none - assumes there is 1 NUMA node, never binds threads auto - this is the default value, depends on the number of set threads and NUMA nodes, will only enable binding on multinode systems and when the number of threads reaches a threshold (dependent on node size and count) [[custom]] - // ':'-separated numa nodes // ','-separated cpu indices // supports "first-last" range syntax for cpu indices, for example '0-15,32-47:16-31,48-63' ``` Setting `NumaPolicy` forces recreation of the threads in the ThreadPool, which in turn forces the recreation of the TT. The threads are distributed among NUMA nodes in a round-robin fashion based on fill percentage (i.e. it will strive to fill all NUMA nodes evenly). Threads are bound to NUMA nodes, not specific processors, because that's our only requirement and the OS can schedule them better. Special care is made that maximum memory usage on systems that do not require memory replication stays as previously, that is, unnecessary copies are avoided. On linux the process' processor affinity is respected. This means that if you for example use taskset to restrict Stockfish to a single NUMA node then the `system` and `auto` settings will only see a single NUMA node (more precisely, the processors included in the current affinity mask) and act accordingly. ----------------- We can't ensure that a memory allocation takes place on a given NUMA node without using libnuma on linux, or using appropriate custom allocators on windows (https://learn.microsoft.com/en-us/windows/win32/memory/allocating-memory-from-a-numa-node), so to avoid complications the current implementation relies on first-touch policy. Due to this we also rely on the memory allocator to give us a new chunk of untouched memory from the system. This appears to work reliably on linux, but results may vary. MacOS is not supported, because AFAIK it's not affected, and implementation would be problematic anyway. Windows is supported since Windows 7 (https://learn.microsoft.com/en-us/windows/win32/api/processtopologyapi/nf-processtopologyapi-setthreadgroupaffinity). Until Windows 11/Server 2022 NUMA nodes are split such that they cannot span processor groups. This is because before Windows 11/Server 2022 it's not possible to set thread affinity spanning processor groups. The splitting is done manually in some cases (required after Windows 10 Build 20348). Since Windows 11/Server 2022 we can set affinites spanning processor group so this splitting is not done, so the behaviour is pretty much like on linux. Linux is supported, **without** libnuma requirement. `lscpu` is expected. ----------------- Passed 60+1 @ 256t 16000MB hash: https://tests.stockfishchess.org/tests/view/6654e443a86388d5e27db0d8 ``` LLR: 2.95 (-2.94,2.94) <0.00,10.00> Total: 278 W: 110 L: 29 D: 139 Ptnml(0-2): 0, 1, 56, 82, 0 ``` Passed SMP STC: https://tests.stockfishchess.org/tests/view/6654fc74a86388d5e27db1cd ``` LLR: 2.95 (-2.94,2.94) <-1.75,0.25> Total: 67152 W: 17354 L: 17177 D: 32621 Ptnml(0-2): 64, 7428, 18408, 7619, 57 ``` Passed STC: https://tests.stockfishchess.org/tests/view/6654fb27a86388d5e27db15c ``` LLR: 2.94 (-2.94,2.94) <-1.75,0.25> Total: 131648 W: 34155 L: 34045 D: 63448 Ptnml(0-2): 426, 13878, 37096, 14008, 416 ``` fixes official-stockfish#5253 closes official-stockfish#5285 No functional change

Official release version of Stockfish 17 Bench: 1484730 --- Stockfish 17 Today we have the pleasure to announce a new major release of Stockfish. As always, you can freely download it at https://stockfishchess.org/download and use it in the GUI of your choice. Don’t forget to join our Discord server[1] to get in touch with the community of developers and users of the project! *Quality of chess play* In tests against Stockfish 16, this release brings an Elo gain of up to 46 points[2] and wins up to 4.5 times more game pairs[3] than it loses. In practice, high-quality moves are now found in less time, with a user upgrading from Stockfish 14 being able to analyze games at least 6 times[4] faster with Stockfish 17 while maintaining roughly the same quality. During this development period, Stockfish won its 9th consecutive first place in the main league of the Top Chess Engine Championship (TCEC)[5], and the 24th consecutive first place in the main events (bullet, blitz, and rapid) of the Computer Chess Championship (CCC)[6]. *Update highlights* *Improved engine lines* This release introduces principal variations (PVs) that are more informative for mate and decisive table base (TB) scores. In both cases, the PV will contain all moves up to checkmate. For mate scores, the PV shown is the best variation known to the engine at that point, while for table base wins, it follows, based on the TB, a sequence of moves that preserves the game outcome to checkmate. *NUMA performance optimization* For high-end computers with multiple CPUs (typically a dual-socket architecture with 100+ cores), this release automatically improves performance with a `NumaPolicy` setting that optimizes non-uniform memory access (NUMA). Although typical consumer hardware will not benefit, speedups of up to 2.8x[7] have been measured. *Shoutouts* *ChessDB* During the past 1.5 years, hundreds of cores have been continuously running Stockfish to grow a database of analyzed positions. This chess cloud database[8] now contains well over 45 billion positions, providing excellent coverage of all openings and commonly played lines. This database is already integrated into GUIs such as En Croissant[9] and Nibbler[10], which access it through the public API. *Leela Chess Zero* Generally considered to be the strongest GPU engine, it continues to provide open data which is essential for training our NNUE networks. They released version 0.31.1[11] of their engine a few weeks ago, check it out! *Website redesign* Our website has undergone a redesign in recent months, most notably in our home page[12], now featuring a darker color scheme and a more modern aesthetic, while still maintaining its core identity. We hope you'll like it as much as we do! *Thank you* The Stockfish project builds on a thriving community of enthusiasts (thanks everybody!) who contribute their expertise, time, and resources to build a free and open-source chess engine that is robust, widely available, and very strong. We would like to express our gratitude for the 11k stars[13] that light up our GitHub project! Thank you for your support and encouragement – your recognition means a lot to us. We invite our chess fans to join the Fishtest testing framework[14] to contribute compute resources needed for development. Programmers can contribute to the project either directly to Stockfish[15] (C++), to Fishtest[16] (HTML, CSS, JavaScript, and Python), to our trainer nnue-pytorch[17] (C++ and Python), or to our website[18] (HTML, CSS/SCSS, and JavaScript). The Stockfish team [1] https://discord.gg/GWDRS3kU6R [2] https://tests.stockfishchess.org/tests/view/66d738ba9de3e7f9b33d159a [3] https://tests.stockfishchess.org/tests/view/66d738f39de3e7f9b33d15a0 [4] https://github.com/official-stockfish/Stockfish/wiki/Useful-data#equivalent-time-odds-and-normalized-game-pair-elo [5] https://en.wikipedia.org/wiki/Stockfish_(chess)#Top_Chess_Engine_Championship [6] https://en.wikipedia.org/wiki/Stockfish_(chess)#Chess.com_Computer_Chess_Championship [7] official-stockfish#5285 [8] https://chessdb.cn/queryc_en/ [9] https://encroissant.org/ [10] https://github.com/rooklift/nibbler [11] https://github.com/LeelaChessZero/lc0/releases/tag/v0.31.1 [12] https://stockfishchess.org/ [13] https://github.com/official-stockfish/Stockfish/stargazers [14] https://github.com/official-stockfish/fishtest/wiki/Running-the-worker [15] https://github.com/official-stockfish/Stockfish [16] https://github.com/official-stockfish/fishtest [17] https://github.com/official-stockfish/nnue-pytorch [18] https://github.com/official-stockfish/stockfish-web

Sopel97 force-pushed the numareplicationv3 branch 6 times, most recently from 6bd321f to 91a511d Compare May 25, 2024 10:39

Disservin reviewed May 25, 2024

View reviewed changes

src/numa.h Outdated Show resolved Hide resolved

Sopel97 force-pushed the numareplicationv3 branch 3 times, most recently from a841c27 to c6ca84d Compare May 25, 2024 19:54

Sopel97 marked this pull request as ready for review May 28, 2024 11:54

Sopel97 changed the title ~~[WIP] NUMA memory replication for NNUE weights~~ NUMA memory replication for NNUE weights May 28, 2024

Sopel97 force-pushed the numareplicationv3 branch from 245fb86 to e837721 Compare May 28, 2024 13:13

Sopel97 force-pushed the numareplicationv3 branch 2 times, most recently from a027ef9 to f525045 Compare May 28, 2024 13:34

Sopel97 force-pushed the numareplicationv3 branch 3 times, most recently from a9799d6 to 842d5ca Compare May 28, 2024 13:53

Sopel97 force-pushed the numareplicationv3 branch from d4f8fa2 to 0047103 Compare May 28, 2024 14:57

Sopel97 force-pushed the numareplicationv3 branch from 0047103 to 5325384 Compare May 28, 2024 15:05

vondele force-pushed the numareplicationv3 branch 3 times, most recently from e43a34d to a1086b7 Compare May 28, 2024 16:06

WIP

59ae284

vondele force-pushed the numareplicationv3 branch from a1086b7 to 59ae284 Compare May 28, 2024 16:07

ci shall be fixed

934d6e3

vondele closed this in a169c78 May 28, 2024

Disservin mentioned this pull request Aug 20, 2024

Split Glue File & Prepare for Sf 17 lichess-org/lila-stockfish-web#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUMA memory replication for NNUE weights #5285

NUMA memory replication for NNUE weights #5285

Sopel97 commented May 23, 2024 •

edited

Loading

vondele commented May 23, 2024

vondele commented May 23, 2024

mstembera commented May 25, 2024

Sopel97 commented May 25, 2024 •

edited

Loading

vondele commented May 27, 2024 •

edited

Loading

vondele commented May 27, 2024

Disservin commented May 27, 2024 •

edited

Loading

vondele commented May 27, 2024

vondele commented May 28, 2024

snicolet commented May 28, 2024

Sopel97 commented May 28, 2024

vondele commented May 28, 2024

Sopel97 commented May 28, 2024

Disservin commented May 28, 2024

Sopel97 commented May 28, 2024 •

edited

Loading

Disservin commented May 28, 2024

vondele commented May 28, 2024

Sopel97 commented May 28, 2024 •

edited

Loading

NUMA memory replication for NNUE weights #5285

NUMA memory replication for NNUE weights #5285

Conversation

Sopel97 commented May 23, 2024 • edited Loading

vondele commented May 23, 2024

vondele commented May 23, 2024

mstembera commented May 25, 2024

Sopel97 commented May 25, 2024 • edited Loading

vondele commented May 27, 2024 • edited Loading

vondele commented May 27, 2024

Disservin commented May 27, 2024 • edited Loading

vondele commented May 27, 2024

vondele commented May 28, 2024

snicolet commented May 28, 2024

Sopel97 commented May 28, 2024

vondele commented May 28, 2024

Sopel97 commented May 28, 2024

Disservin commented May 28, 2024

Sopel97 commented May 28, 2024 • edited Loading

Disservin commented May 28, 2024

vondele commented May 28, 2024

Sopel97 commented May 28, 2024 • edited Loading

Sopel97 commented May 23, 2024 •

edited

Loading

Sopel97 commented May 25, 2024 •

edited

Loading

vondele commented May 27, 2024 •

edited

Loading

Disservin commented May 27, 2024 •

edited

Loading

Sopel97 commented May 28, 2024 •

edited

Loading

Sopel97 commented May 28, 2024 •

edited

Loading