Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

Closed
goord opened this issue Nov 8, 2023 · 8 comments · Fixed by #1788
Closed

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

goord opened this issue Nov 8, 2023 · 8 comments · Fixed by #1788
Assignees
Labels

Comments

@goord
Copy link
Collaborator

goord commented Nov 8, 2023

When launching many-replica parallel fits on the trvl-mask-layers branch on the CPU, a couple of things seem to be broken. With 10 replicas, things run fine. With 50 replicas, following error occurs at the training stage:

[WARNING]:  > NaN found, stopping activated
[CRITICAL]: Bug in n3fit ocurred. Please report it.
Traceback (most recent call last):
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
    super().run()
  File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
    super().run()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
    rb.execute_sequential()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
    result = self.get_result(callspec.function,
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
    fres =  function(**kwdict)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 266, in performfit
    log.info("Stopped at epoch=%d", stopping_object.stop_epoch)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/stopping.py", line 387, in stop_epoch
    return self._history.final_epoch + 1
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

which seems like a side effect of the stopping refactoring.

Surprisingly, when running 100 parallel replicas, the training step fails with a different error:

[CRITICAL]: Bug in n3fit ocurred. Please report it.
Traceback (most recent call last):
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
    super().run()
  File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
    super().run()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
    rb.execute_sequential()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
    result = self.get_result(callspec.function,
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
    fres =  function(**kwdict)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 262, in performfit
    result = pdf_gen_and_train_function(parameters)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 942, in hyperparametrizable
    passed = self._train_and_fit(models["training"], stopping_object, epochs=epochs,)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 743, in _train_and_fit
    training_model.perform_fit(
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/MetaModel.py", line 170, in perform_fit
    history = super().fit(x=x_params, y=y, epochs=epochs, **kwargs)
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2' defined at (most recent call last):
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/bin/n3fit", line 33, in <module>
      sys.exit(load_entry_point('n3fit', 'console_scripts', 'n3fit')())
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 298, in main
      a.main()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 395, in main
      self.run()
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
      super().run()
    File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
      super().run()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
      rb.execute_sequential()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
      result = self.get_result(callspec.function,
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
      fres =  function(**kwdict)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 262, in performfit
      result = pdf_gen_and_train_function(parameters)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 942, in hyperparametrizable
      passed = self._train_and_fit(models["training"], stopping_object, epochs=epochs,)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 743, in _train_and_fit
      training_model.perform_fit(
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/MetaModel.py", line 170, in perform_fit
      history = super().fit(x=x_params, y=y, epochs=epochs, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 859, in train_step
      y_pred = self(x, training=True)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/functional.py", line 451, in call
      return self._run_internal_graph(
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/functional.py", line 589, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/layers/mask.py", line 49, in call
      if self.mask is not None:
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/layers/mask.py", line 50, in call
      flat_res = op.boolean_mask(ret, self.mask, axis=self.axis)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/operations.py", line 225, in boolean_mask
      return tf.boolean_mask(*args, **kwargs)
Node: 'meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2'
indices[88] = -4936617473272247041 is not in [0, 24800)
	 [[{{node meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2}}]] [Op:__inference_train_function_323127]

Apparently the mask layer fails for the BCDMSD_dw_ite dataset, but I have seen the same issue occur for other datasets too.

@goord
Copy link
Collaborator Author

goord commented Nov 8, 2023

First issue comes from the fact that if a NaN is encountered in the first epoch, the fit history is still empty but the handler (print statement) wants to use it for printing. I guess we can check in the print statement whether there was a successful epoch to report on, otherwise print something else (alarming).

Now as to why a NaN is encounetered in the first epoch, that seems another bug...

@Radonirinaunimi
Copy link
Member

Radonirinaunimi commented Nov 8, 2023

First issue comes from the fact that if a NaN is encountered in the first epoch, the fit history is still empty but the handler (print statement) wants to use it for printing. I guess we can check in the print statement whether there was a successful epoch to report on, otherwise print something else (alarming).

Now as to why a NaN is encounetered in the first epoch, that seems another bug...

I am very confused! I thought that with the current state of #1788 (which includes already the stopping refactoring #1792, but the not the hyperopt fixes #1820), you were able to reproduce the exact numerical values as in the baseline (modulo the TF versions)?

Are these errors dataset/replica-dependent? I am afraid I don't understand how exactly these arise.

@goord
Copy link
Collaborator Author

goord commented Nov 8, 2023

Hi @Radonirinaunimi that was for test with 10 replicas, but at higher replica counts, problems seem to pop up...

@Radonirinaunimi
Copy link
Member

Hi @Radonirinaunimi that was for test with 10 replicas, but at higher replica counts, problems seem to pop up...

That is then very weird. Conceptually and implementation-wise there shouldn't be a difference between 10 and 50.

@goord
Copy link
Collaborator Author

goord commented Nov 8, 2023

hmm on my laptop the 50 replica parallel fit does run (memory leaks are now definitely history)... again an issue with the cluster software stack perhaps...

@Radonirinaunimi
Copy link
Member

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

@goord
Copy link
Collaborator Author

goord commented Nov 15, 2023

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

NNPDF40_nnlo_as_01180_1000.yml.txt

@Radonirinaunimi Radonirinaunimi linked a pull request Nov 28, 2023 that will close this issue
@Radonirinaunimi
Copy link
Member

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

NNPDF40_nnlo_as_01180_1000.yml.txt

Hi @goord, just to confirm that I did check on our cluster and the attached run card works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants