-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trvl-mask-layers parallel NNPDF4.0 fit broken #1838
Comments
First issue comes from the fact that if a NaN is encountered in the first epoch, the fit history is still empty but the handler (print statement) wants to use it for printing. I guess we can check in the print statement whether there was a successful epoch to report on, otherwise print something else (alarming). Now as to why a NaN is encounetered in the first epoch, that seems another bug... |
I am very confused! I thought that with the current state of #1788 (which includes already the stopping refactoring #1792, but the not the hyperopt fixes #1820), you were able to reproduce the exact numerical values as in the baseline (modulo the TF versions)? Are these errors dataset/replica-dependent? I am afraid I don't understand how exactly these arise. |
Hi @Radonirinaunimi that was for test with 10 replicas, but at higher replica counts, problems seem to pop up... |
That is then very weird. Conceptually and implementation-wise there shouldn't be a difference between 10 and 50. |
hmm on my laptop the 50 replica parallel fit does run (memory leaks are now definitely history)... again an issue with the cluster software stack perhaps... |
@goord, could you please send me the run card that you are using such that I can try this on our cluster? |
|
When launching many-replica parallel fits on the trvl-mask-layers branch on the CPU, a couple of things seem to be broken. With 10 replicas, things run fine. With 50 replicas, following error occurs at the training stage:
which seems like a side effect of the stopping refactoring.
Surprisingly, when running 100 parallel replicas, the training step fails with a different error:
Apparently the mask layer fails for the
BCDMSD_dw_ite
dataset, but I have seen the same issue occur for other datasets too.The text was updated successfully, but these errors were encountered: