trvl-mask-layers parallel NNPDF4.0 fit broken #1838

goord · 2023-11-08T08:14:28Z

When launching many-replica parallel fits on the trvl-mask-layers branch on the CPU, a couple of things seem to be broken. With 10 replicas, things run fine. With 50 replicas, following error occurs at the training stage:

[WARNING]:  > NaN found, stopping activated
[CRITICAL]: Bug in n3fit ocurred. Please report it.
Traceback (most recent call last):
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
    super().run()
  File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
    super().run()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
    rb.execute_sequential()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
    result = self.get_result(callspec.function,
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
    fres =  function(**kwdict)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 266, in performfit
    log.info("Stopped at epoch=%d", stopping_object.stop_epoch)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/stopping.py", line 387, in stop_epoch
    return self._history.final_epoch + 1
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

which seems like a side effect of the stopping refactoring.

Surprisingly, when running 100 parallel replicas, the training step fails with a different error:

[CRITICAL]: Bug in n3fit ocurred. Please report it.
Traceback (most recent call last):
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
    super().run()
  File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
    super().run()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
    rb.execute_sequential()
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
    result = self.get_result(callspec.function,
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
    fres =  function(**kwdict)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 262, in performfit
    result = pdf_gen_and_train_function(parameters)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 942, in hyperparametrizable
    passed = self._train_and_fit(models["training"], stopping_object, epochs=epochs,)
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 743, in _train_and_fit
    training_model.perform_fit(
  File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/MetaModel.py", line 170, in perform_fit
    history = super().fit(x=x_params, y=y, epochs=epochs, **kwargs)
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2' defined at (most recent call last):
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/bin/n3fit", line 33, in <module>
      sys.exit(load_entry_point('n3fit', 'console_scripts', 'n3fit')())
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 298, in main
      a.main()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 395, in main
      self.run()
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/scripts/n3fit_exec.py", line 286, in run
      super().run()
    File "/gpfs/home6/gijstest/src/nnpdf/validphys2/src/validphys/app.py", line 152, in run
      super().run()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/app.py", line 380, in run
      rb.execute_sequential()
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 166, in execute_sequential
      result = self.get_result(callspec.function,
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/reportengine/resourcebuilder.py", line 175, in get_result
      fres =  function(**kwdict)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/performfit.py", line 262, in performfit
      result = pdf_gen_and_train_function(parameters)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 942, in hyperparametrizable
      passed = self._train_and_fit(models["training"], stopping_object, epochs=epochs,)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/model_trainer.py", line 743, in _train_and_fit
      training_model.perform_fit(
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/MetaModel.py", line 170, in perform_fit
      history = super().fit(x=x_params, y=y, epochs=epochs, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/training.py", line 859, in train_step
      y_pred = self(x, training=True)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/functional.py", line 451, in call
      return self._run_internal_graph(
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/functional.py", line 589, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/engine/base_layer.py", line 1096, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/gijstest/.conda/envs/nnpdf-dev-cpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/layers/mask.py", line 49, in call
      if self.mask is not None:
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/layers/mask.py", line 50, in call
      flat_res = op.boolean_mask(ret, self.mask, axis=self.axis)
    File "/gpfs/home6/gijstest/src/nnpdf/n3fit/src/n3fit/backends/keras_backend/operations.py", line 225, in boolean_mask
      return tf.boolean_mask(*args, **kwargs)
Node: 'meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2'
indices[88] = -4936617473272247041 is not in [0, 24800)
	 [[{{node meta_model/trmask_BCDMSD_dw_ite/boolean_mask/GatherV2}}]] [Op:__inference_train_function_323127]

Apparently the mask layer fails for the BCDMSD_dw_ite dataset, but I have seen the same issue occur for other datasets too.

The text was updated successfully, but these errors were encountered:

goord · 2023-11-08T10:13:51Z

First issue comes from the fact that if a NaN is encountered in the first epoch, the fit history is still empty but the handler (print statement) wants to use it for printing. I guess we can check in the print statement whether there was a successful epoch to report on, otherwise print something else (alarming).

Now as to why a NaN is encounetered in the first epoch, that seems another bug...

Radonirinaunimi · 2023-11-08T11:10:23Z

First issue comes from the fact that if a NaN is encountered in the first epoch, the fit history is still empty but the handler (print statement) wants to use it for printing. I guess we can check in the print statement whether there was a successful epoch to report on, otherwise print something else (alarming).

Now as to why a NaN is encounetered in the first epoch, that seems another bug...

I am very confused! I thought that with the current state of #1788 (which includes already the stopping refactoring #1792, but the not the hyperopt fixes #1820), you were able to reproduce the exact numerical values as in the baseline (modulo the TF versions)?

Are these errors dataset/replica-dependent? I am afraid I don't understand how exactly these arise.

goord · 2023-11-08T11:11:49Z

Hi @Radonirinaunimi that was for test with 10 replicas, but at higher replica counts, problems seem to pop up...

Radonirinaunimi · 2023-11-08T11:14:41Z

Hi @Radonirinaunimi that was for test with 10 replicas, but at higher replica counts, problems seem to pop up...

That is then very weird. Conceptually and implementation-wise there shouldn't be a difference between 10 and 50.

goord · 2023-11-08T20:38:35Z

hmm on my laptop the 50 replica parallel fit does run (memory leaks are now definitely history)... again an issue with the cluster software stack perhaps...

Radonirinaunimi · 2023-11-14T17:23:13Z

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

goord · 2023-11-15T12:33:44Z

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

NNPDF40_nnlo_as_01180_1000.yml.txt

Radonirinaunimi · 2023-11-28T20:45:04Z

@goord, could you please send me the run card that you are using such that I can try this on our cluster?

NNPDF40_nnlo_as_01180_1000.yml.txt

Hi @goord, just to confirm that I did check on our cluster and the attached run card works fine.

goord assigned goord and APJansen Nov 8, 2023

Radonirinaunimi linked a pull request Nov 28, 2023 that will close this issue

Parallel replicas with varying tr-vl masks #1788

Merged

RoyStegeman added the escience label Nov 29, 2023

scarlehoff closed this as completed in #1788 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

goord commented Nov 8, 2023 •

edited

Loading

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 8, 2023 •

edited

Loading

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 8, 2023

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 14, 2023

goord commented Nov 15, 2023

Radonirinaunimi commented Nov 28, 2023

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

Comments

goord commented Nov 8, 2023 • edited Loading

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 8, 2023 • edited Loading

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 8, 2023

goord commented Nov 8, 2023

Radonirinaunimi commented Nov 14, 2023

goord commented Nov 15, 2023

Radonirinaunimi commented Nov 28, 2023

goord commented Nov 8, 2023 •

edited

Loading

Radonirinaunimi commented Nov 8, 2023 •

edited

Loading