Skip to content

Neural Networks

Lyudmil Antonov edited this page Feb 20, 2017 · 1 revision

tthsqe12 commented 26 days ago cluster version has little to do with asm programming. What is important there is linking different computers so that tasks can be distributed and results can be collected. A cluster version would also not be tied to a specific uci engine unless some special snyc commands were introduced.

Ipmanchess commented 26 days ago • edited Many people don't know me..but this is what i want to read..for years i'm a private tester..always in the background,what i like..getting for years all top engines for testing..i have build up this trust for more then 20years with these chess programmers..What i all have done i can't write here..many would be surpriced or simple not believe it..that doesn't matter..i wanted just chess getting stronger and enjoy my hobby! When i see someone new in the chess world programming then i will follow him..and i remember very well how many critic you get or just disbelieve that it's not possible to do a complete rewrite from Stockfish into assembler..well you did it!! Every time i see some progress in your asmFish i try it..always crashes,bugs..but still you continue and don't give up.. I asked you for Numa..you give me a testversion and two versions later i had a full working asmFish Numa without touching this dual Xeon system!! Same happen for Texel Numa from Peter Osterlund..just a few testversions and it worked! Next step would be Cluster..is just a name i know..if you know a way for linking different computers with each other would be great..and then to come to this AI or this Neural networks!! Fantastic project!! If i have not been pushing and ask you for Numa..then Stockfish or other engines would not have it at all! No one was interested when i talk about it during TCEC..but now these people who has a dual Xeon system are very happy with it..system get stressed for 100% now..before people think they see high nodes/sec. but system was not optimal used as chess engines didn't have Numa..except Houdini but needed also a update.. I can only give you my appreciations & support what you doing for the chess-world..just amazing ;) Thank you! Ipman. @lantonov Owner lantonov commented 26 days ago @Ipman, I learned about you 3 years ago for your famous "Ipman compiles" which were the fastest SF available. @hero2017

hero2017 commented 26 days ago I'm all for what Ipman has said. There are times we get $hit on for working hard and other times we do good but don't get the recognition or reward that is deserved. Sometimes even a simple Great job! Thank you! would suffice for all our hard work.

So I too would like to express my thanks for your hard work because I know it can't be easy keeping up with all the updates that the SF dev guys release.

Hopefully asmFish is here to stay forever! It is absolutely the fastest engine. No other engine, not even the SF dev releases, can compare. NUMA was definitely worth it the wait for! Being a 36 core owner myself I cannot appreciate it enough.

So keep up the great work you're doing because while we may not always say it, we certainly appreciate every byte you put into it! Thank you sir! … @hero2017

hero2017 commented 22 days ago • edited As much as I'd love a cluster, the other side of me says it'd be bad for chess because it would create these problems for us chess lovers:

Those who are rich or have access to a cluster of the fastest and most expensive computers out there would win most games against their poor opponents - essentially eliminating poor souls like me from winning tournaments in correspondence chess

It is a big step towards killing chess altogether as it'd be that much closer to solving it and would only create more draws although I suppose this is debatable.

I like the Neural64 project more. We can setup one or more pc's to have asmFish play against other top engines and "learn" from its mistakes or perhaps to create a monster opening book.

Here's another idea. Maybe add an engine parameter - when doing infinite analysis (IA), if option is enabled, it would run a very fast game from its current position and make the move based on the one that has the best wins to losses to draws ratio. For example. While analyzing the position from 20. Nc7 and if it has two or more alternate moves that seem just as good, have the engine play the game after each alternate move and then decide which is the best move based on the games that had the most wins, the least losses, and the least draws. Kinda like the Rybka 3 Monte Carlo feature but do this mid-game, especially if it can't decide which move is the better one. I realize you might say this is more of a feature for a chess gui but if it's possible to have the engine to have multiple options on deciding which is the better move then the better the engine will be and it would separate asmFish more as the SF clone.

https://en.chessbase.com/post/rybka-s-monte-carlo-analysis @tthsqe12 Collaborator tthsqe12 commented 22 days ago So I was thinking about trying a neural net for evaluation of specific material combos. For example, can the lookup table for KPvK endgames be replaced by a simple (not more than a dozen nodes) net? @lantonov Owner lantonov commented 22 days ago • edited Nodes are input (Ni), output (No), and hidden (Nh). We can start with one hidden layer. Ni is equal to the number of columns in the lookup table, No is usually 1, and Nh = (Ni + No) / 2. So, if we have for example 6 columns in the lookup table, Ni = 6, No = 1, and Nh = 3.5 ~ 4. 6 + 1 + 4 = 11. We can add 1 bias hidden neuron to become 12. Now, if we have a vector function the nodes are more than this. For example, in NeuroChess Ni = No = 175 and Nh = 165. The final number of nodes is found by trial and error, and if we have too much, we can prune the ones that do not react. For KPvK, the 6 columns (features) may be:

strong pieces weak side weak pieces strong pawn strong king weak king @lantonov Owner lantonov commented 21 days ago • edited This is from Giraffe thesis: image The evaluator network is a 4-layer network (three hidden layers plus output layer), where all hidden nodes use Rectied Linear activation (ReLU) [Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier networks. In Proceedings of the 14th International Conference on Articial Intelligence and Statistics. JMLRW&CP Volume, volume 15, pages 315{323, 2011; in Wikipedia], which is a modern choice that is much more recent than the traditional hyperbolic tangent and logistic activation functions, especially for networks with more than two hidden layers. To constrain the output between -1 and 1, the output node uses hyperbolic tangent activation. When the feature representation includes features with different modalities, it is often beneficial to only mix them in higher, more abstract layers. This is because at very low levels, data from different modalities cannot be mixed usefully, and the extra connections offer no benefit (and in fact makes the model more susceptible to overfitting). In this case, there are three modalities - piece-centric, square-centric, and position-centric. In the first two layers, features from different modalities are kept separate. Only the final two layers are fully connected to capture interactions between high level concepts derived from features from the different modalities. The figure above illustrates the network architecture and the limited connectivity scheme, where an arrow means all nodes in the first group are connected to all nodes in the second group.

Giraffe source code is also freely available https://bitbucket.org/waterreaction/giraffe/get/default.zip @lantonov Owner lantonov commented 21 days ago • edited @tthsqe12 I think that your approach is very good: first build small networks with only 1 hidden layer for particular feature combos (like KPvK), train them separately to be at least as good as the conventional evals, and as these small networks accumulate, link them into a second hidden layer. This is similar to the Giraffe architecture. There is a recent, very interesting Stockfish Evaluation Guide which makes a breakdown of all SF evaluation features. We can also try various features, and feature combinations to see which of them trains better. For example:

weak pawn (can be captured), strong pawn (can promote), weak king (can be mated), strong king (can pass the pawn), stalemate threats This one can have only 10(9) nodes (5 Ni, 4(3) Nh, 1 No). @lantonov Owner lantonov commented 20 days ago • edited Another idea for a very simple KPvK net. Only 7 nodes (3 Ni, 3 Nh, 1 No). Ni take vector inputs. Ni1 takes piece vector, Ni2 takes squares vector, and Ni3 takes vector of additional features such as side to move, immediate promotion, immediate stalemate. For example, the position 8/2k5/8/2P5/3K4/8/8/8 w - - 0 1 can be input like: Ni1 <-- (wK, wP, bK), Ni2 <-- (d4, c5, c7), Ni3 <-- (w, 0, 0). For reference output to calculate the error for back propagation, we compare the static SF evaluation, which for this position is 174. It is preferable, however, that input nodes take numbers (scalars), and not vectors. The above vectors can be encoded in numbers in one of several ways like observing positions of elements, multiplication by different constants or a combination of those. If necessary, several additional nodes can be added to input, like splitting black and white pieces in separate Ni, additional Ni for positional features like pawn attacked, pawn defended, etc. @lantonov Owner lantonov commented 19 days ago • edited If it is required that each input node takes elementary input, the network can contain 385 Ni. There are six unique chess pieces and 64 fields on the board. So for every square we take 6 nodes (1 for every piece), 64*6=384. If there is a white piece, the input value is 1. If there is a black piece, the value is -1. And if there is no piece of that sort on that field, the value is 0. In addition to that there should be 1 node for the player to move. If it is White's turn, the input value is 1 and if it's Black's turn, the value is -1. The output will be a numerical value (1 No). The higher the value is, the better is the position for the white player. Nh will be of the order (385 + 1) / 2 = 193 ~ 200. I think this is more realistic NN than the previous (easier to train). @lantonov Owner lantonov commented 15 days ago http://syprog.blogspot.bg/2012/03/trivial-artificial-neural-network-in.html I give this one with a risk to offend you as it is super elementary. I put it mainly because it shows how to code the logistic function into assembly on the math co-processor. Neural64 above doesn't know how to implement this function and uses something else. BTW, we have several alternatives for an activating function and logistic is not necessarily the best one. @lantonov Owner lantonov commented 14 days ago • edited The above plan with 385 Ni has 2 flaws

The input represents a sparse matrix of 0, 1, and -1 which brings in little information to be processed by Nh, i.e. a wastage of input nodes The sum of the inputs is not proportional to the expected output; we don't give any hint to the computer about the rough estimate of the position A better alternative is to use about 70 Ni, 64 of which encode the position in the following way Each of these position Ni is labeled with the square (a1 to h8). The pieces have the common Reinfeld values

          White     Black
 Empty    0.0        0.0
 Pawn     1.0       -1.0
 Knight   3.0       -3.0
 Bishop   3.0       -3.0
 Rook     5.0       -5.0
 Queen    9.0       -9.0
 King     100       -100

The squares have the following values from White perspective (for Black, the perspective is reversed)

1.6 1.7 1.8 1.9 1.9 1.8 1.7 1.6
1.5 1.6 1.7 1.8 1.8 1.7 1.6 1.5
1.4 1.5 1.6 1.7 1.7 1.6 1.5 1.4
1.3 1.4 1.5 1.6 1.6 1.5 1.4 1.3
1.2 1.3 1.4 1.5 1.5 1.4 1.3 1.2
1.1 1.2 1.3 1.4 1.4 1.3 1.2 1.1
1.0 1.1 1.2 1.3 1.3 1.2 1.1 1.0
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

You see that these square values get bigger towards the higher ranks and towards the center which gives incentive for the pieces to move forward and to the center. The value that is input on the respective node is the product piece * square * 100 centipawns. For example, Qh3 is 990 cp, and Qe4 is 1350 cp. If we have white Qh4 and black Qh5, 1080 - 1080 = 0 cp summary for the Queens, etc. Thus, the input evaluation of the position is roughly proportional to the expected output. The other 5-6 Ni can serve to input such features of the position as right to move, castling right, check, mate, stalemate, attacked and defended pieces, etc. @tthsqe12 Collaborator tthsqe12 commented 12 days ago The nnet at https://board.flatassembler.net/topic.php?t=6020 didn't seem to be general purpose, so made a more general frame work and tested the standard mnist dataset (60000 training, 10000 testing) on a nnet with one hidden layer of 32 nodes. It seems to be converging to the expected error rate, but 1.3GFLOPS is shameful performance for this cpu.

    asmFishW_2017-02-05_bmi2
    epoch: 0.00  error: 88.65%   1287.97 MFLOPS
    epoch: 0.13  error: 78.72%   1287.97 MFLOPS
    epoch: 0.26  error: 50.02%   1287.16 MFLOPS
    epoch: 0.40  error: 25.46%   1287.36 MFLOPS
    epoch: 0.53  error: 24.25%   1287.48 MFLOPS
    epoch: 0.66  error: 20.53%   1287.36 MFLOPS
    epoch: 0.80  error: 10.92%   1287.45 MFLOPS
    epoch: 0.93  error: 9.16%   1287.67 MFLOPS
    epoch: 1.06  error: 8.73%   1287.70 MFLOPS
    epoch: 1.20  error: 8.44%   1287.60 MFLOPS
    epoch: 1.33  error: 8.34%   1287.53 MFLOPS
    epoch: 1.46  error: 8.30%   1287.56 MFLOPS
    epoch: 1.60  error: 7.86%   1287.60 MFLOPS
    epoch: 1.73  error: 7.35%   1287.53 MFLOPS
    epoch: 1.86  error: 7.87%   1287.48 MFLOPS
    epoch: 2.00  error: 7.21%   1287.51 MFLOPS
    epoch: 2.13  error: 7.37%   1287.54 MFLOPS
    epoch: 2.26  error: 7.23%   1287.50 MFLOPS
    epoch: 2.40  error: 7.21%   1287.52 MFLOPS
    epoch: 2.53  error: 7.24%   1287.54 MFLOPS
    epoch: 2.66  error: 6.79%   1287.56 MFLOPS
    epoch: 2.80  error: 6.94%   1287.58 MFLOPS
    epoch: 2.93  error: 6.87%   1287.60 MFLOPS
    epoch: 3.06  error: 6.74%   1287.61 MFLOPS
    epoch: 3.20  error: 6.52%   1287.58 MFLOPS
    epoch: 3.33  error: 6.65%   1287.50 MFLOPS
    epoch: 3.46  error: 6.36%   1287.52 MFLOPS
    epoch: 3.60  error: 6.76%   1287.45 MFLOPS
    epoch: 3.73  error: 6.47%   1287.47 MFLOPS
    epoch: 3.86  error: 6.62%   1287.48 MFLOPS

@lantonov Owner lantonov commented 12 days ago Glad that you are acting on it in your usual way: much work and little talk. Can't wait to test/train such a beauty.

Ipmanchess commented 12 days ago Count me in for Testings ,if needed ;) @tthsqe12 Collaborator tthsqe12 commented 10 days ago so im training a 256 - 128 - 64 - 1 net from stockfish's evaluation function. The first two layers have a 'ramp' function, and the last layer (which produces the output) has no such activation function. After a couple of 'go depth 25', the net is averaging 60cp from stockfish's evaluation. Every time evaluate is called, the net is trained, which seems to slow it down from 2Mnps to 130Knps. Not sure if this is the right approach @Ipmanchess

Ipmanchess commented 10 days ago I don't know if it's usefull for you..there are some interesting talks on Talkchess busy about parallel speedup & deep learning(Deep Pink) : http://www.talkchess.com/forum/viewforum.php?f=7 Is this training for example possible to let it do by a grafik card using all these cores(Cuda) so that asmFish still has full cpu power..? @lantonov Owner lantonov commented 10 days ago • edited Just to get on the same page: this is a network with 256 input nodes, 2 hidden layers (128 nodes and 64 nodes) and output layer (1 node). The ramp functions act between the input layer and the first hidden layer and between the first hidden layer and the second hidden layer. Is that right? For the speed and effectiveness of training, of utmost importance is the logical structuring of the input especially wrt expected output. The GIGO problem in NN is huge. Input can be structured in indefinitely many ways, some better than others. Previously, I listed some possible inputs without any claim that those are the best, however. The type of activating function is also important. If by 'ramp' function is understood the function f(x) = max(0,x), it is the same as the Rectifier Linear Unit (ReLU) mentioned above in the Mathew Lai approach. ReLU is preferred over the common logistic / tanh function because of the following features that allow for faster and more efficient training of deep neural architectures on large and complex datasets:

One-sided (0,∞) compared to the antisymmetry of tanh (-1,1) Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output). Efficient gradient propagation: No vanishing or exploding gradient problems. Efficient computation: Only comparison, addition and multiplication. Scale-invariant: max(0,ax)=amax(0,x). However, there are also potential problems with ReLU:

Non-differentiable at zero: however it is differentiable anywhere else, including points arbitrarily close to (but not equal to) zero. Non-zero centered Unbounded : Could potentially blow up. Dying ReLU problem for high learning rates. If all inputs put the ReLU on the flat side, there's no hope that the weights change at all and the node is dead. A ReLU may be alive then die due to the gradient step for some input batch driving the weights to smaller values, making the subsequent values < 0 for all inputs. A large learning rate amplifies this problem. The slowdown of the NN in terms of Mnps during training is of no particular concern compared to evaluation error when one takes into account that the NN does static evaluation (input a position, output a value, one time) while SF evaluation at depth=25 is dynamic: it searches through 25 leaf nodes (layers) and internally crunches many intermediate evaluations of the resulting positions by essentially similar to NN approach (forward and back propagation). Increasing NN speed is linear while increasing depth is exponential wrt to processing power. As concerns GPUs, it has been proven both in theory and in practice that those can greatly speed up neural networks. Assembly may have problems with GPU, however, because they are different from CPU (different registers, different ALU, different FPU, etc.). There may be problems with portability (GPU from different manufacturers). @Ipmanchess

Ipmanchess commented 10 days ago Maybe interesting: http://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf @lantonov Owner lantonov commented 10 days ago • edited I was just reading it. Very complicated and slow NN, but we may arrive to it eventually, God forbid. The idea in the thread to use tablebases for training is very good, though. TB's eval (-1,0, or 1) is as reliable as can be. Of course, TB positions would be in addition to opening, middlegame and early endgame positions. @tthsqe12 Collaborator tthsqe12 commented 10 days ago So i originally put kings on the same 64 inputs as pawns, but i just didn't like that aesthetically. So now there are 320 inputs with queens combined in rooks and bishops. I don't like +1, and -1 inputs, so I might try 640 inputs also. pic170206 @lantonov Owner lantonov commented 10 days ago I will study these some more. Just made a commit for the last SF change "Simplify scale factor computation" with only removal Evaluate.asm:1703-1704. Can you please check it? Bench didn't change as should be. @lantonov Owner lantonov commented 9 days ago The activation function that you use is not ramp but it's very close to logistic in properties. The ramp function looks like a ramp: image and the function f(x) = (|x|+x+1)/2/(|x|+1) is a sigmoid image f(x) = (|x|+x+1)/2/(|x|+1) larger domain image very similar to the logistic function Logistic image

The derivative of f(x) = (|x|+x+1)/2/(|x|+1) is f'(x) = 1/2/(|x|+1)^2 and looks like: image which is roughly similar to the derivative of the logisic (f'(x) = exp(x)/(exp(x)+1)^2) image

The first layer matrix shows some periodicity which almost disappears in the second layer matrix and the vector fed to the output. This, if anything, shows that the NN is working towards some transformation of the input. @tthsqe12 Collaborator tthsqe12 commented 9 days ago Sorry, I accidentally put my logistic function (which i used for mnist data) there instead of my ramp function. This is the corrected version pic After many games and training on each call to evaluate, the net output was still around 60cp from sf's evaluation, so i can say the net is basically producing junk.

As for the latest patch, i do not like it as the comment " remove one condition which can never be satisfied." is false. The one who made this patch failed to realize that the condition is sometime true (about 5% on bench), but when it is true the result of the scale factor computation is multiplied by zero. It is this kind of carelessness that is worrying about the stockfish project. @lantonov Owner lantonov commented 9 days ago f(x)=(|x|+1)/((|x|-x+1)^2+1) is indeed a ramp function image Unlike the usual ramp function f(x)=max(0,x), the above function is smooth (has derivative on the whole domain). The derivative is (-2 x |x| + |x|^3 + |x| - x^3 + 2 x^2)/(2 |x| (-x |x| + |x| + x^2 - x + 1)^2) image

Don't be discouraged by the non-convergence at the first try. I don't know of any NN that has succeeded from the start. All require many modifications and parameter tuning.

The last patch didn't change bench so I will revert it right away.

@tthsqe12 Collaborator tthsqe12 commented 9 days ago nono, you don't need to revert it! It is actually faster by a little bit. I was just saying the comment was misleading. @lantonov Owner lantonov commented 9 days ago • edited Sorry, I acted too quickly. Reinstating. git reset --hard HEAD^ git push origin master -f @lantonov Owner lantonov commented 8 days ago The greatest impact on performance is caused by the structure of the input layer. The board representation is very similar to the basic example of number recognition. The number images are coded in an 8x8 matrix. The squares of the board completely correspond to such matrix. The "pixels" / squares shade can correspond to pieces and on the input nodes (not more than 70) you can put floating point numbers like 0. (empty), 1. (pawn), 2. (knight), 3. (bishop), 4. (rook), 5. (queen), 6. (king). The only difference with the image recognition will be that this is a regressor and not classifier. I think that 320 or 640 input nodes are too much and instead of giving the NN the board pattern they confuse it with too much uninformative connections. Training should start not with games but with feeding boards on the input and reference evaluations on the output to allow the network to create its proper weights through propagation.

Clone this wiki locally