-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New task for adding scalar values (0 or 1) #4
base: master
Are you sure you want to change the base?
Conversation
Impressive work! What do you think about taking it up a notch? I've just pushed new updates to the code that include optimizations in both memory and execution time performance, so you would be able to leave it training for more iterations while doing this more quickly! I'm looking forwrad to see your results with this! |
Hello, @Mostafa-Samir. You can get the code of the adding task without the tf.resume_sum() in here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py. But I'm afraid that removing the tf.reduce_sum() makes the model unable to generalize with success with a fixed memory size as before. In this new version of the code, the model is still able to learn to resolve any sequence of 0 and 1 sums, but it fails when we try to use the learned model to larger sequences than that used in the training process. I think that's because the original version I pulled here make use of the tf.reduce_sum() as a way of accumulator. I think the model learns an algorithm like this: function(X): And later, the tf.reduce_sum() makes the correct sum over all the sequence output. The output will have a nearly 1 for each [ 1. 0. 0.] input vector, and a nearly 0 in other case, and finally the tf.reduce_sum() will give the correct answer no matter the large the input is. And I think is because this little "if else" f(x) algorithm is easy to learn that the model is able to generalize to unlimited large inputs X sequences with a fixed memory size. As soon as we remove the tf.reduce_sum() like in the version I made following your instructions, this trick doesn't work and the model has to learn other more complex and less generalizable algorithm than the f(x) I told you before. What do you think, @Mostafa-Samir? Regards, |
Here you have a little excerpt of a real training result of the new version (https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py): Avg. Cross-Entropy: 0.0231753 Iteration 1001/1001 Testing generalization... Iteration 0/1000 Iteration 1/1000 Iteration 2/1000 |
@Mostafa-Samir, due to the great improvement in the core of your DNC implementation I've developed another task for testing the project. I've made a model that successfully is able to learn a argmax function over a input. The model is feed with a vector of onehot integer values, and the target output is the index inside the vector with the maximum value. I'm glad to say to you that your DNC is able to learn this function using just a feedforward controller, and even better, ¡is able to generalize to larger vectors of those used in the training process! You can see my code here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/argmax/train_v2.py. And here you can see some results: Iteration 10000/10001 Saving Checkpoint ... Done! Testing generalization... Iteration 0/10000 Iteration 1/10000 Iteration 2/10000 Iteration 3/10000 Iteration 4/10000 Iteration 5/10000 Iteration 6/10000 Iteration 7/10000 Iteration 8/10000 Iteration 9/10000 Iteration 10/10000 Iteration 11/10000 Iteration 12/10000 Iteration 13/10000 Iteration 14/10000 Iteration 15/10000 Iteration 16/10000 Iteration 17/10000 Iteration 18/10000 Iteration 19/10000 Iteration 20/10000 Iteration 21/10000 Iteration 22/10000 Iteration 23/10000 Iteration 24/10000 Iteration 25/10000 Iteration 26/10000 Iteration 27/10000 Iteration 28/10000 Iteration 29/10000 Iteration 30/10000 Iteration 31/10000 Iteration 32/10000 Iteration 33/10000 Iteration 34/10000 Iteration 35/10000 Iteration 36/10000 Iteration 37/10000 Iteration 38/10000 Iteration 39/10000 Iteration 40/10000 Iteration 41/10000 Iteration 42/10000 Iteration 43/10000 Iteration 44/10000 Iteration 45/10000 Iteration 46/10000 Iteration 47/10000 Iteration 48/10000 Iteration 49/10000 I don't know how the model is able to figure out where has been the highest value in the sequence of onehot encoded input values but it does, and even is able to generalize this learned method to sequences double of the size used in the training process without more memory use. DeepMind has found something big with this DNC, and they are improving it with a sparse version able to use less resources: https://arxiv.org/pdf/1610.09027v1.pdf Regards, |
Great work Samu @Zeta36 ! Regarding the adding task loss = tf.reduce_mean(tf.square((loss_weights * output) - ncomputer.target_output)) while you should be using: loss = tf.reduce_mean(loss_weights * tf.square(output - ncomputer.target_output)) Remember, you're weighting the contribution of the loss of each step not the significance of each step on its own. Mathematically it's written as not I don't really know how you generate the output vector, but the 1st formulation can easily overestimate your loss value. Try to adopt this change and see if it has any effect on the model. You should also try to test the generalization of the adding by using the same trained model but with larger memory matrix (more locations) just as you can find in the visualization notebook of the copy task. It'd also be a good idea to separate the generalization tests into different scripts than the training one, and try to use a single descriptive statistic (like the percentage of correct answers, or the percentage of error or whatever you decide) to describe your results so instead of dumping the entire log in the README you can just add one or two examples from the log and describe your results with that statistic! I'll be happy then to merge your contributions to repo! |
Hi @Zeta36 and @Mostafa-Samir , For this reason, I am trying to implement a further task by myself. I am interested in understanding if a DNC can solve it. I would really appreciate any feedback from you, thanks. Task descriptionThe task is to count the total number of repeated numbers in a list. For example:
The pseudo code the DNC should learn is:
I am wondering if the DNC can manage by itself the SettingsAssuming that the DNC can solve the task (I suppose a simple LSTM net can), I would structure the data as follows:
What do you think? Do you think that it would be feasible for the DNC to solve the task? Thanks, |
Common Settings
The model is trained on 2-layer feedforward controller (with hidden sizes 128 and 256 respectively) with the following set of hyperparameters:
A square loss function of the form: (y - y_)**2 is used. Where both 'y' and 'y_' are scalar numbers.
The input is a (1, random_length, 3) tensor, where the 3 is for a one-hot encoding vector of size 3, where:
010 is a '0'
100 is a '1'
001 is the end mark
So, and example of an input of length 10 will be the next 3D-tensor:
[[[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]]
This input is a represenation of a sequence of adding 0 or 1 values in the form of:
0 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + (end_mark)
The target outoput is a 3D-tensor with the result of this adding task. In the example above:
[[[2.0]]]
The DNC output is a 3D-tensor of shape (1, random_length, 1). For example:
[[[ 0.45]
[ -0.11]
[ 1.3]
[ 5.0]
[ 0.5]
[ 0.1]
[ 1.0]
[ -0.5]
[ 0.33]
[ 0.12]]]
The target output and the DNC output are both then reduced with tf.reduce_sum() so we end up with two scalar values. For example:
Target_output: 2.0
DNC_output: 5.89
And we apply then the square loss function:
loss = (Target_o - DNC_o)**2
and finally the gradient update.
Results
The model is going to recieve as input a random length sequence of 0 or 1 values like:
Input: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1
Then it will return a scalar value for this input adding proccess. For example, the DNC will output something like: 3.98824.
This value will be the predicted result for the input adding sequence (we are going to truncate the integer part of the result):
DNC prediction: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]
Once we train the model with:
we can see that the model learns in less than 1000 iterations to compute this adding function, and the loss drop from:
Iteration 0/1000
Avg. Logistic Loss: 24.9968
to:
Iteration 1000/1000
Avg. Logistic Loss: 0.0076
It seems like the DNC model is able to learn this pseudo-code:
function(x):
if (x == [ 1. 0. 0.])
return (near) 1.0 (float values)
else
return (near) 0.0 (float values)
Generalization test
We use for the model a sequence_max_length = 100, but in the training proccess we use just random length sequences up to 10 (sequence_max_length/10). Once the train is finished, we let the trained model to generalize to random length sequences up to 100 (sequence_max_length).
Results show that the model successfully generalize the adding task even with sequence 10 times larger than the training ones.
These are real data outputs:
Building Computational Graph ... Done!
Initializing Variables ... Done!
Iteration 0/1000
Avg. Logistic Loss: 24.9968
Real value: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 5
Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 0 [0.000319847]
Iteration 100/1000
Avg. Logistic Loss: 5.8042
Real value: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 5
Predicted: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 6 [6.1732]
Iteration 200/1000
Avg. Logistic Loss: 0.7492
Real value: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 9
Predicted: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 8 [8.91952]
Iteration 300/1000
Avg. Logistic Loss: 0.0253
Real value: 0 + 1 + 1 = 2
Predicted: 0 + 1 + 1 = 2 [2.0231]
Iteration 400/1000
Avg. Logistic Loss: 0.0089
Real value: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 3
Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 2 [2.83419]
Iteration 500/1000
Avg. Logistic Loss: 0.0444
Real value: 1 + 0 + 1 + 1 = 3
Predicted: 1 + 0 + 1 + 1 = 2 [2.95937]
Iteration 600/1000
Avg. Logistic Loss: 0.0093
Real value: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 4
Predicted: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]
Iteration 700/1000
Avg. Logistic Loss: 0.0224
Real value: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 6
Predicted: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 5 [5.93554]
Iteration 800/1000
Avg. Logistic Loss: 0.0115
Real value: 0 + 0 = 0
Predicted: 0 + 0 = -1 [-0.0118587]
Iteration 900/1000
Avg. Logistic Loss: 0.0023
Real value: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 5
Predicted: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 4 [4.97147]
Iteration 1000/1000
Avg. Logistic Loss: 0.0076
Real value: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4Done!
Testing generalization...
Iteration 0/1000
Predicted: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4 [4.123]
Saving Checkpoint ...
Real value: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6
Predicted: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6 [6.24339]
Iteration 1/1000
Real value: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11
Predicted: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11 [11.1931]
Iteration 2/1000
Real value: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 33
Predicted: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 32 [32.9866]
Iteration 3/1000
Real value: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16
Predicted: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16 [16.1541]
Iteration 4/1000
Real value: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 44
Predicted: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 43 [43.5211]