-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI/Cluster implementation for Stockfish #1571
Conversation
Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications. To compile, point COMPCXX to the MPI C++ compiler wrapper (mpicxx).
Interesting! Since it appears to compile just fine in the absence of MPI, what about doing a first run in fishtest to measure elo difference (e.g. due to overhead) with respect to master. This could be done with a fixed number of games (e.g. 60000) if you expect overhead, and want to know how much, or SPRT [-3,1] if you're sure there is no overhead (maybe even [-4,0] for this kind of change) and you want to get a pass/fail answer. To test using cutechess-cli you can use something like:
(the opening book 2moves_v1.pgn can be found on github, or in the fishtest/worker/testing directory after it participated in games). An interesting MPI test would be to see how performance (elo) improves comparing master on one node (or socket), against your branch on N nodes. Ideally you get a roughly 50elo improvement going from 1 to 2 nodes (and similar but decreasing, for every doubling of nodes). |
There should be no overhead (all the functions are inlined and do nothing or are constexpr and return a constant value—the compiler should be able to optimize it out completely). However, using fishtest to confirm this is probably a good idea. The test I want to run is indeed to compare performance with MPI on N nodes vs master on 1 node. Scaling should weaker than threads (as there will be some duplicated work and communication overhead), but the overall performance should (hopefully) be better than the 1 node case (if it isn't, that means that there's unacceptable overhead). |
I did a (quick) 10 game match of master versus MPI. cutechess-cli output warnings of the type:
I'm wondering why that is—and if it can just be ignored. |
One issue I'm clearly seeing (on a 100-game match) is that the move reduction must be rethought. MPI is often losing on time or drawing by 3-fold repetition (which I believe should be much rarer than I'm seeing—about 15% of the games). MPI hasn't won once, so I think something is wrong. One thing I did change from the earlier test is that I'm using 24 threads, instead of 8 (the nodes have 2-socket 12-core Intel Xeons). In the prior 10-game test, the final result was 3-3-4. |
@omor1 the warning isn't good... SF has normally no illegal moves in the principal variation (PV). I'm guessing this implies a bug somewhere. You might want to recompile once with the make option debug=yes (after a make clean). The high 3-fold rate is presumably OK. I don't know the number but that doesn't sound impossible. Loosing on time... well, on the short time control (10s+0.1s) this should actually be expected for an MPI parallel code. I guess if SF internally signals a stop it might take far longer than 30ms to return a move (a single threaded code takes already ~10ms on a non-idle machine). So, you might want to set the uci option 'Move Overhead' to something like 1000 (=1s) instead of 30. |
There is some chance you prove that this is not totally useless (as I think it is)? Sprt[-3, 1] to validate 300 lines of useless code seems a joke to me, but I prefer to stay quiet in these times, I am confident in Stephan. |
As it is currently written, it's almost definitely broken. I think that in isolation, each move is correct, but that there is some commutative effect over time. There's also the issue that @vondele mentioned—that the move overhead is higher, so that without setting the move overhead correctly it can lose on time. I haven't yet had the time to look at it in detail to work out the kinks. The SPRT[-3, 1] is to verify that it has no effect when NOT compiling with MPI, so that is doesn't affect current users. I've submitted a test to fishtest to verify this. If you think it isn't useful to test it right now (since the code is broken anyway) you can go ahead and cancel it. |
Obviously current fishtest won't be able to run your MPI build and measure performance. |
@noobpwnftw it was never meant to—the current test is to ensure that there isn't a regression in performance as compared to master when the cluster code is compiled without MPI. There shouldn't be any performance impact, but it's important to make sure that that is indeed the case. |
@omor1 the point is that you verify for no regression when not using MPI withouth first describe why this patch is supposed to be valuable. In your description I can't read:
You miss all the above and just say, "Hey, I have this fancy 300 lines of new stuff, this does not regress! Great news, please apply". You started from the end, instead than from the beginning. |
@mcostalba obviously the value in the patch lies in going beyond shared memory parallelism (i.e. when threading can't work, on a cluster). So this has nothing to do with 'NUMA' as we currently have it. What it does (potentially) allow is to run SF on 1000s or cores, given the cluster hardware. Obviously fishtest can't be used to test the value of this patch, it can be used at best to verify there is no impact for those who don't compile with mpi enabled. I started the test yesterday (before this discussion), and it doesn't regress at [-4,0] bounds. Now, I agree that this PR doesn't contain any data yet to show that the current patch is correct in parallel (the PV warning above shows it likely is not), nor that the performance is good in parallel. Once it looks like the code is bug free, I'd be happy to play a few matches to see how it scales. |
Again, I can hardly see a reason to merge this patch directly in this repository rather than in a fork. It has to be bug free before any meaningful performance tests take place. To run performance tests, you could start with a few nodes, get a grip of how the test procedure should go(it has to communicate with the engines remotely, I guess), if things look promising, I can arrange some tests on a few hundred nodes. |
Let me answer this for you.
The title of the PR says it all.
It has nothing to do with NUMA.
It has nothing to do with NUMA |
I would like to add that this has nothing to do with NUMA, but I also doubt that it is a good idea to merge this fork into SF master. A cluster version of SF is probably better off on its own. |
@vdbergh if you had wrote "Google for it", you would have addressed it in exactly the same way. |
I think that Bujun Guo's suggestion to use a fork (or a separate branch in official-stockfish) is the wisest thing to do here. This could be a nice research project to construct a massively parallel Stockfish running on a Chinese cluster, in order for instance to win the World Computer Chess Championships in Stockholm (July 2018), see https://icga.leidenuniv.nl/?p=2281 . Or a rematch against AlphaGo, anyone? But I suppose that in both cases it will need some debugging before being really competitive :-) |
@omor1 I have created a new branch on official-stockfish with this state of the MPI/Cluster code (basically the code in the pull request rebased on master "Tweak the connected[] array value for pawns on rank 5" of May 3rd, 2018). The branch is named https://github.com/official-stockfish/Stockfish/tree/cluster This is the method we have used in the past when developing important new features, for instance the LazySMP implementation was developed in a parallel branch for months before being incorporated in the master branch. Here are some git commands for people interested to test this MPI/Cluster idea:
|
I have added a section in the FAQ wiki about this MPI/Cluster branch, with the same instructions to help people to pull the branch. I understand that this is only a very minimalist start for such a nice idea, the branch definitely deserves some more documentation and exposure to get into a mature state. Please provide any ideas to help Omri :-) |
Based on Peter Österlund's "Lazy Cluster" algorithm, but with some simplifications.
To compile, point COMPCXX to the MPI C++ compiler wrapper (
mpicxx
) during compilation.To run, execute
mpiexec -n <NODES> /path/to/stockfish
.To ensure that one process is launched per node or per socket/NUMA node (to use threads locally), it may be necessary to add implementation-specific commands (e.g.
--map-by ppr:1:node
in Open MPI, or-ppn 1
for MPICH).This isn't quite the cleanest, but it appears to work.
I'm still working on testing the changes—I can't use fishtest directly, since it isn't set up to compile with MPI and doesn't have access to a cluster. I could use cutechess-cli to run a bunch of tests, if I'm given the correct commands (equivalent to what is done in fishtest).
I'd appreciate some help in evaluating the performance improvement of using multiple nodes!