Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KerasNLP Bug/Error at Docstring/Documentation class example provided. #1784

Open
Humbulani1234 opened this issue Aug 19, 2024 · 0 comments
Open
Labels
type:Bug Something isn't working

Comments

@Humbulani1234
Copy link

Describe the bug

I encountred an error/bug while trying to execute a docstring code example from the file keras_nlp.src.models.gpt2.causal_lm.py and I have reproduced the example code below:

features = ["a quick fox.", "a fox quick."]
vocab = {"<|endoftext|>": 0, "a": 4, "Ġquick": 5, "Ġfox": 6}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]

tokenizer = keras_nlp.models.GPT2Tokenizer(
    vocabulary=vocab,
    merges=merges,
)
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor(
    tokenizer=tokenizer,
    sequence_length=128,
)
backbone = keras_nlp.models.GPT2Backbone(
    vocabulary_size=30552,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM(
    backbone=backbone,
    preprocessor=preprocessor,
)
gpt2_lm.fit(x=features, batch_size=2)

The following is a comprehensive description of the error, reproduced below and debugging using pdb:

> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/models/causal_lm.py(79)__init__()
-> super().__init__(*args, **kwargs)
(Pdb) c
> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/models/causal_lm.py(140)compile()
-> super().compile(
(Pdb) c
> /home/humbulani/keras-master/nlp_example.py(94)<module>()
-> gpt2_lm.fit(x=features, batch_size=2)
(Pdb) c
> /home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/utils/pipeline_model.py(196)fit()
-> return super().fit(
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(269)fit()
-> self._assert_compile_called("fit")
(Pdb) c
> /home/humbulani/keras-master/keras/src/trainers/epoch_iterator.py(66)__init__()
-> self.data_adapter = data_adapters.get_data_adapter(
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(331)fit()
-> logs = self.train_function(iterator)
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(125)one_step_on_iterator()
-> """Runs a single training step given a Dataset iterator."""
(Pdb) c
> /home/humbulani/keras-master/keras/src/backend/tensorflow/trainer.py(50)train_step()
-> x, y, sample_weight = data_adapter_utils.unpack_x_y_sample_weight(data)
(Pdb) c
> /home/humbulani/keras-master/keras/src/losses/losses.py(1724)sparse_categorical_crossentropy()
-> res = ops.sparse_categorical_crossentropy(
(Pdb) c
2024-08-19 12:03:15.742874: W tensorflow/core/framework/op_kernel.cc:1840] OP_REQUIRES failed at sparse_xent_op.cc:103 : INVALID_ARGUMENT: Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2024-08-19 12:03:15.742930: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: INVALID_ARGUMENT: Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Traceback (most recent call last):
  File "/home/humbulani/keras-master/nlp_example.py", line 94, in <module>
    gpt2_lm.fit(x=features, batch_size=2)
  File "/home/humbulani/keras-master/env/lib/python3.10/site-packages/keras_nlp/src/utils/pipeline_model.py", line 196, in fit
    return super().fit(
  File "/home/humbulani/keras-master/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/humbulani/keras-master/env/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 5983, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__SparseSoftmaxCrossEntropyWithLogits_device_/job:localhost/replica:0/task:0/device:CPU:0}} Received a label value of -1 which is outside the valid range of [0, 30552).  Label values: 4 5 6 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 6 5 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 [Op:SparseSoftmaxCrossEntropyWithLogits] name: 

Th error is clear: the -1 value. I've traced the error to the following function from the file keras.src.backend.tensorflow.trainer:

@tf.autograph.experimental.do_not_convert
        def one_step_on_iterator(iterator):
            """Runs a single training step given a Dataset iterator."""
            data = next(iterator) 
            outputs = self.distribute_strategy.run(
                one_step_on_data, args=(data,)
            )
            outputs = reduce_per_replica(
                outputs,
                self.distribute_strategy,
                reduction="auto",
            )
            return outputs

The line data=next(iterator) computes the labels and therefore the -1 value is created here. The iterator argument is a tensorflow OwnedIterator and executes from the file tensorflow.python.data.ops.iterator_ops and the executed function reproduced below:

def _next_internal(self):
    autograph_status = autograph_ctx.control_status_ctx().status
    autograph_disabled = autograph_status == autograph_ctx.Status.DISABLED
    if not context.executing_eagerly() and autograph_disabled:
      self._get_next_call_count += 1
      if self._get_next_call_count > GET_NEXT_CALL_ERROR_THRESHOLD:
        raise ValueError(GET_NEXT_CALL_ERROR_MESSAGE)

    if not context.executing_eagerly():
      # TODO(b/169442955): Investigate the need for this colocation constraint.
      with ops.colocate_with(self._iterator_resource):
        ret = gen_dataset_ops.iterator_get_next(
            self._iterator_resource,
            output_types=self._flat_output_types,
            output_shapes=self._flat_output_shapes)
      return structure.from_compatible_tensor_list(self._element_spec, ret)

which executes gen_dataset_ops.iterator_get_next from the file tensorflow.python.data.ops.gen_dataset_ops, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.

Enviroment

Linux 6.5.0-26-generic #26~22.04.1-Ubuntu
keras - 3.5.0
python - 3.10.12
tensorflow - 2.17.0
kerasNLP - 0.14.4
  • Additional tensorflow info:
2024-08-19 12:20:02.135293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-19 12:20:02.154198: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-19 12:20:02.159831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-19 12:20:02.174579: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-19 12:20:03.092334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-08-19 12:20:04.517556: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

To Reproduce

Link to a Colab Notebook

Expected behavior

I expected the model to train normally by running the fit() function without any complications and return a History object.

Would you like to help us fix it?

@mehtamansi29 mehtamansi29 added the type:Bug Something isn't working label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants