MPI/Cluster implementation for Stockfish #1571

omor1 · 2018-04-29T20:49:19Z

Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications.
To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx) during compilation.
To run, execute mpiexec -n <NODES> /path/to/stockfish.
To ensure that one process is launched per node or per socket/NUMA node (to use threads locally), it may be necessary to add implementation-specific commands (e.g. --map-by ppr:1:node in Open MPI, or -ppn 1 for MPICH).

This isn't quite the cleanest, but it appears to work.
I'm still working on testing the changes—I can't use fishtest directly, since it isn't set up to compile with MPI and doesn't have access to a cluster. I could use cutechess-cli to run a bunch of tests, if I'm given the correct commands (equivalent to what is done in fishtest).
I'd appreciate some help in evaluating the performance improvement of using multiple nodes!

Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications. To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx).

vondele · 2018-04-29T22:41:29Z

Interesting!

Since it appears to compile just fine in the absence of MPI, what about doing a first run in fishtest to measure elo difference (e.g. due to overhead) with respect to master. This could be done with a fixed number of games (e.g. 60000) if you expect overhead, and want to know how much, or SPRT [-3,1] if you're sure there is no overhead (maybe even [-4,0] for this kind of change) and you want to get a pass/fail answer.

To test using cutechess-cli you can use something like:

./cutechess-cli -repeat -rounds 10000 -tournament gauntlet -pgnout results.pgn -srand 1942127704 -resign movecount=3 score=400 -draw movenumber=34 movecount=8 score=20 -concurrency 1 -openings file=2moves_v1.pgn format=pgn order=random plies=16 -engine name=master cmd=stockfish.master -engine name=mpi cmd=stockfish.mpi -each proto=uci tc=10.00+0.10 option.Threads=8  option.Hash=32 -ratinginterval 1

(the opening book 2moves_v1.pgn can be found on github, or in the fishtest/worker/testing directory after it participated in games).

An interesting MPI test would be to see how performance (elo) improves comparing master on one node (or socket), against your branch on N nodes. Ideally you get a roughly 50elo improvement going from 1 to 2 nodes (and similar but decreasing, for every doubling of nodes).

omor1 · 2018-04-29T23:40:53Z

There should be no overhead (all the functions are inlined and do nothing or are constexpr and return a constant value—the compiler should be able to optimize it out completely). However, using fishtest to confirm this is probably a good idea.

The test I want to run is indeed to compare performance with MPI on N nodes vs master on 1 node. Scaling should weaker than threads (as there will be some duplicated work and communication overhead), but the overall performance should (hopefully) be better than the 1 node case (if it isn't, that means that there's unacceptable overhead).

omor1 · 2018-04-30T00:44:47Z

I did a (quick) 10 game match of master versus MPI. cutechess-cli output warnings of the type:

Warning: Illegal PV move f5e6 from mpi
Warning: Illegal PV move e8d8 from mpi
Warning: Illegal PV move b8d8 from mpi
Warning: Illegal PV move f5e4 from mpi

I'm wondering why that is—and if it can just be ignored.

omor1 · 2018-04-30T01:30:21Z

One issue I'm clearly seeing (on a 100-game match) is that the move reduction must be rethought. MPI is often losing on time or drawing by 3-fold repetition (which I believe should be much rarer than I'm seeing—about 15% of the games). MPI hasn't won once, so I think something is wrong.

One thing I did change from the earlier test is that I'm using 24 threads, instead of 8 (the nodes have 2-socket 12-core Intel Xeons). In the prior 10-game test, the final result was 3-3-4.

vondele · 2018-04-30T07:11:47Z

@omor1 the warning isn't good... SF has normally no illegal moves in the principal variation (PV). I'm guessing this implies a bug somewhere. You might want to recompile once with the make option debug=yes (after a make clean).

The high 3-fold rate is presumably OK. I don't know the number but that doesn't sound impossible.

Loosing on time... well, on the short time control (10s+0.1s) this should actually be expected for an MPI parallel code. I guess if SF internally signals a stop it might take far longer than 30ms to return a move (a single threaded code takes already ~10ms on a non-idle machine). So, you might want to set the uci option 'Move Overhead' to something like 1000 (=1s) instead of 30.

mcostalba · 2018-04-30T22:25:59Z

There is some chance you prove that this is not totally useless (as I think it is)?

Sprt[-3, 1] to validate 300 lines of useless code seems a joke to me, but I prefer to stay quiet in these times, I am confident in Stephan.

omor1 · 2018-04-30T22:40:44Z

As it is currently written, it's almost definitely broken. I think that in isolation, each move is correct, but that there is some commutative effect over time. There's also the issue that @vondele mentioned—that the move overhead is higher, so that without setting the move overhead correctly it can lose on time. I haven't yet had the time to look at it in detail to work out the kinks.

The SPRT[-3, 1] is to verify that it has no effect when NOT compiling with MPI, so that is doesn't affect current users. I've submitted a test to fishtest to verify this. If you think it isn't useful to test it right now (since the code is broken anyway) you can go ahead and cancel it.

noobpwnftw · 2018-04-30T23:10:05Z

Obviously current fishtest won't be able to run your MPI build and measure performance.
I think this should go into a fork. Scaling curve and how to implement parallel search with MPI will be well dependent on how nodes in a cluster are interconnected, or the cluster configuration in general, which falls into a different area and hardly has anything to do with the standalone version.

omor1 · 2018-04-30T23:22:32Z

@noobpwnftw it was never meant to—the current test is to ensure that there isn't a regression in performance as compared to master when the cluster code is compiled without MPI. There shouldn't be any performance impact, but it's important to make sure that that is indeed the case.

mcostalba · 2018-05-01T06:39:37Z

@omor1 the point is that you verify for no regression when not using MPI withouth first describe why this patch is supposed to be valuable.

In your description I can't read:

What is the scope of this patch, which platform is supposed to address and in which way
Why this patch is better than current implementation of NUMA
NUMA is currently handled only for Windows, this is not a case. After some extensive discussion/tests in the past we found this to be the only case that needed addressing and we did it. Now if you think that your patch is better then current you should at least say under which platform/case and provide some support data/test.

You miss all the above and just say, "Hey, I have this fancy 300 lines of new stuff, this does not regress! Great news, please apply".

You started from the end, instead than from the beginning.

vondele · 2018-05-01T07:24:54Z

@mcostalba obviously the value in the patch lies in going beyond shared memory parallelism (i.e. when threading can't work, on a cluster). So this has nothing to do with 'NUMA' as we currently have it. What it does (potentially) allow is to run SF on 1000s or cores, given the cluster hardware.

Obviously fishtest can't be used to test the value of this patch, it can be used at best to verify there is no impact for those who don't compile with mpi enabled. I started the test yesterday (before this discussion), and it doesn't regress at [-4,0] bounds.
http://tests.stockfishchess.org/tests/view/5ae778b00ebc5926dba90dd6
LLR: 2.95 (-2.94,2.94) [-4.00,0.00]
Total: 58379 W: 11757 L: 11813 D: 34809

Now, I agree that this PR doesn't contain any data yet to show that the current patch is correct in parallel (the PV warning above shows it likely is not), nor that the performance is good in parallel. Once it looks like the code is bug free, I'd be happy to play a few matches to see how it scales.

noobpwnftw · 2018-05-01T08:11:27Z

Again, I can hardly see a reason to merge this patch directly in this repository rather than in a fork.
It can be an interesting fork like Cfish, asmFish, BrainFish, etc.
As for checking non-regression for cosmetic changes in non-MPI builds in this regard, I wonder why it is necessary, because whoever is going to use the MPI fork won't be compiling a non-MPI version anyway.

It has to be bug free before any meaningful performance tests take place. To run performance tests, you could start with a few nodes, get a grip of how the test procedure should go(it has to communicate with the engines remotely, I guess), if things look promising, I can arrange some tests on a few hundred nodes.

vdbergh · 2018-05-01T08:22:34Z

@mcostalba

Let me answer this for you.

What is the scope of this patch, which platform is supposed to address and in which way

The title of the PR says it all.

Why this patch is better than current implementation of NUMA

It has nothing to do with NUMA.

NUMA is currently handled only for Windows, this is not a case. After some extensive discussion/tests in the past we found this to be the only case that needed addressing and we did it. Now if you think that your patch is better then current you should at least say under which platform/case and provide some support data/test.

It has nothing to do with NUMA

syzygy1 · 2018-05-01T12:09:29Z

I would like to add that this has nothing to do with NUMA, but I also doubt that it is a good idea to merge this fork into SF master. A cluster version of SF is probably better off on its own.

mcostalba · 2018-05-01T17:11:30Z

@vdbergh if you had wrote "Google for it", you would have addressed it in exactly the same way.

snicolet · 2018-05-01T22:37:42Z

I think that Bujun Guo's suggestion to use a fork (or a separate branch in official-stockfish) is the wisest thing to do here.

This could be a nice research project to construct a massively parallel Stockfish running on a Chinese cluster, in order for instance to win the World Computer Chess Championships in Stockholm (July 2018), see https://icga.leidenuniv.nl/?p=2281 . Or a rematch against AlphaGo, anyone?

But I suppose that in both cases it will need some debugging before being really competitive :-)

snicolet · 2018-05-04T21:24:53Z

@omor1 I have created a new branch on official-stockfish with this state of the MPI/Cluster code (basically the code in the pull request rebased on master "Tweak the connected[] array value for pawns on rank 5" of May 3rd, 2018). The branch is named cluster, so if you want to propose improvements you now can open pull requests against the official-stockfish/cluster branch.

https://github.com/official-stockfish/Stockfish/tree/cluster

This is the method we have used in the past when developing important new features, for instance the LazySMP implementation was developed in a parallel branch for months before being incorporated in the master branch.

Here are some git commands for people interested to test this MPI/Cluster idea:

If you dont have the cluster branch yet on your local git repository, you can download the latest state of the official-stockfish/cluster branch, then switch to it with the following commands:

   git fetch official cluster:cluster
   git checkout -f cluster

If you already have a previous state of the cluster branch in you local git repository, you can switch to it, then update it with the latest commits of the official-stockfish/cluster branch with the following commands:

    git fetch official
    git checkout cluster
    git reset --hard official/cluster

If, having done (1) or (2), you also want to update the cluster branch on your online GitHub directory, you can use the following:

    git push origin cluster --force

snicolet · 2018-05-08T09:30:47Z

I have added a section in the FAQ wiki about this MPI/Cluster branch, with the same instructions to help people to pull the branch.

I understand that this is only a very minimalist start for such a nice idea, the branch definitely deserves some more documentation and exposure to get into a mature state. Please provide any ideas to help Omri :-)

MPI/Cluster implementation for Stockfish

f403bf5

Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications. To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx).

snicolet closed this May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI/Cluster implementation for Stockfish #1571

MPI/Cluster implementation for Stockfish #1571

omor1 commented Apr 29, 2018

vondele commented Apr 29, 2018

omor1 commented Apr 29, 2018

omor1 commented Apr 30, 2018

omor1 commented Apr 30, 2018

vondele commented Apr 30, 2018

mcostalba commented Apr 30, 2018

omor1 commented Apr 30, 2018

noobpwnftw commented Apr 30, 2018

omor1 commented Apr 30, 2018

mcostalba commented May 1, 2018

vondele commented May 1, 2018

noobpwnftw commented May 1, 2018

vdbergh commented May 1, 2018

syzygy1 commented May 1, 2018

mcostalba commented May 1, 2018

snicolet commented May 1, 2018 •

edited

Loading

snicolet commented May 4, 2018 •

edited

Loading

snicolet commented May 8, 2018

MPI/Cluster implementation for Stockfish #1571

MPI/Cluster implementation for Stockfish #1571

Conversation

omor1 commented Apr 29, 2018

vondele commented Apr 29, 2018

omor1 commented Apr 29, 2018

omor1 commented Apr 30, 2018

omor1 commented Apr 30, 2018

vondele commented Apr 30, 2018

mcostalba commented Apr 30, 2018

omor1 commented Apr 30, 2018

noobpwnftw commented Apr 30, 2018

omor1 commented Apr 30, 2018

mcostalba commented May 1, 2018

vondele commented May 1, 2018

noobpwnftw commented May 1, 2018

vdbergh commented May 1, 2018

syzygy1 commented May 1, 2018

mcostalba commented May 1, 2018

snicolet commented May 1, 2018 • edited Loading

snicolet commented May 4, 2018 • edited Loading

snicolet commented May 8, 2018

snicolet commented May 1, 2018 •

edited

Loading

snicolet commented May 4, 2018 •

edited

Loading