-
Notifications
You must be signed in to change notification settings - Fork 214
Setting all the optimizers to have useLocking = True #310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting all the optimizers to have useLocking = True #310
Conversation
I opened an issue for the non-determinism on tensorflow/tensorflow - tensorflow/tensorflow#48855 |
@Craigacp can you please rebase once more so we can give the quick-build another try? |
…ing a determinism test that's currently failing.
03402b3
to
c0fc351
Compare
Done. |
Looks like everything passed this time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Craigacp , I've approved it but left a few minor comments here and there, if you want to take a look
|
||
// This test fails due to initialization and gradient issues. It should not, but it seems to be a | ||
// problem | ||
// in TF-core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformat this comment?
@@ -42,6 +43,9 @@ | |||
public static final float LEARNING_RATE_DEFAULT = 0.001f; | |||
public static final float INITIAL_ACCUMULATOR_DEFAULT = 0.01f; | |||
|
|||
private static final ApplyAdagrad.Options[] opts = new ApplyAdagrad.Options[]{ | |||
ApplyAdagrad.updateSlots(true),ApplyAdagrad.useLocking(true)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: any formatter will probably complain about the missing space after a comma.
} | ||
|
||
for (int i = 1; i < numRuns; i++) { | ||
assertEquals(initialLoss[0],initialLoss[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nit: spaces after commas.
.fetch(outputWeightName) | ||
.fetch(outputBiasName) | ||
.run()); | ||
System.out.println("Initialized - " + ndArrToString((TFloat32)initialized.get(i).get(3))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the verbosity in the unit test, aren't the equality check enough to validate? I'm fine just commenting out these println
I'll clean up the test on Monday, and split it into two. There should be one test for the outputs returning a reference rather than a copy of the weights (as this might allow you to mutate the weights directly, which seems bad), and the current test which should just check the gradient behaviour. It's all conflated in that single test as I spent hours trying to figure out what was going on, so it has got all the print statements and other stuff necessary for me to track it down. I'll clean up the formatting issues at the same time. |
I'm running into determinism issues when training models using TF-Java. This is one area which could be causing it, as in TF 2 all the optimizers have
useLocking=true
. We don't currently set this in TF-Java, and I'm worried the code path is degrading (as it says the behaviour may be undefined but faster withuseLocking=false
).This doesn't resolve my non-determinism issue completely, but it seems a little better.
I've added a test to GradientDescentTest which checks that the models produced are identical. This test fails randomly, for two reasons, both of which are confusing. First it seems like when we fetch the weights from a model, we get a pointer to the weights, not a copy of them. This means that when they are trained the "copies" I'm saving out as the initialized ones are being updated. Second, and this is the real issue, the gradient updates can be different on identical models for identical data, for no apparent reason. I think this is happening somewhere in the C API, as I can't see where it could be happening in our code and the models use identical GraphDefs.