Skip to content

Setting all the optimizers to have useLocking = True #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 4, 2021

Conversation

Craigacp
Copy link
Collaborator

I'm running into determinism issues when training models using TF-Java. This is one area which could be causing it, as in TF 2 all the optimizers have useLocking=true. We don't currently set this in TF-Java, and I'm worried the code path is degrading (as it says the behaviour may be undefined but faster with useLocking=false).

This doesn't resolve my non-determinism issue completely, but it seems a little better.

I've added a test to GradientDescentTest which checks that the models produced are identical. This test fails randomly, for two reasons, both of which are confusing. First it seems like when we fetch the weights from a model, we get a pointer to the weights, not a copy of them. This means that when they are trained the "copies" I'm saving out as the initialized ones are being updated. Second, and this is the real issue, the gradient updates can be different on identical models for identical data, for no apparent reason. I think this is happening somewhere in the C API, as I can't see where it could be happening in our code and the models use identical GraphDefs.

@Craigacp
Copy link
Collaborator Author

I opened an issue for the non-determinism on tensorflow/tensorflow - tensorflow/tensorflow#48855

@karllessard
Copy link
Collaborator

@rnett , looks like the quick build is failing after merging #306 , can you please take a look at it?

@karllessard
Copy link
Collaborator

@Craigacp can you please rebase once more so we can give the quick-build another try?

@Craigacp Craigacp force-pushed the locking-optimizer-fixes branch from 03402b3 to c0fc351 Compare May 1, 2021 00:56
@Craigacp
Copy link
Collaborator Author

Craigacp commented May 1, 2021

Done.

@Craigacp
Copy link
Collaborator Author

Craigacp commented May 1, 2021

Looks like everything passed this time.

karllessard
karllessard previously approved these changes May 1, 2021
Copy link
Collaborator

@karllessard karllessard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Craigacp , I've approved it but left a few minor comments here and there, if you want to take a look


// This test fails due to initialization and gradient issues. It should not, but it seems to be a
// problem
// in TF-core.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat this comment?

@@ -42,6 +43,9 @@
public static final float LEARNING_RATE_DEFAULT = 0.001f;
public static final float INITIAL_ACCUMULATOR_DEFAULT = 0.01f;

private static final ApplyAdagrad.Options[] opts = new ApplyAdagrad.Options[]{
ApplyAdagrad.updateSlots(true),ApplyAdagrad.useLocking(true)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: any formatter will probably complain about the missing space after a comma.

}

for (int i = 1; i < numRuns; i++) {
assertEquals(initialLoss[0],initialLoss[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: spaces after commas.

.fetch(outputWeightName)
.fetch(outputBiasName)
.run());
System.out.println("Initialized - " + ndArrToString((TFloat32)initialized.get(i).get(3)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the verbosity in the unit test, aren't the equality check enough to validate? I'm fine just commenting out these println

@Craigacp
Copy link
Collaborator Author

Craigacp commented May 2, 2021

I'll clean up the test on Monday, and split it into two. There should be one test for the outputs returning a reference rather than a copy of the weights (as this might allow you to mutate the weights directly, which seems bad), and the current test which should just check the gradient behaviour.

It's all conflated in that single test as I spent hours trying to figure out what was going on, so it has got all the print statements and other stuff necessary for me to track it down.

I'll clean up the formatting issues at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants