Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Getting error when loading back a saved session-based model #1132

Closed
rnyak opened this issue Jun 2, 2023 · 6 comments · Fixed by #1147
Closed

[BUG] Getting error when loading back a saved session-based model #1132

rnyak opened this issue Jun 2, 2023 · 6 comments · Fixed by #1147
Assignees
Labels
bug Something isn't working P1
Milestone

Comments

@rnyak
Copy link
Contributor

rnyak commented Jun 2, 2023

Bug description

I am getting different errors when I try to load back a saved session-based model.

Error 1:

...
ValueError: Unable to restore custom object of type _tf_keras_metric. Please make sure that any custom layers are included in the `custom_objects` arg when calling `load_model()` and make sure that all layers implement `get_config` and `from_config`.

This error goes away if I add import merlin.models.tf as mm after I import tensorflow, and I get another error:

Error 2:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 3
      1 import tensorflow as tf
      2 import merlin.models.tf as mm
----> 3 loaded_model = tf.keras.models.load_model('/models/examples/saved_model')

File /usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/core/combinators.py:129, in SequentialBlock.build(self, input_shape)
    121 """Builds the sequential block
    122 
    123 Parameters
   (...)
    126     The input shape, by default None
    127 """
    128 self._maybe_propagate_context(input_shape)
--> 129 build_sequentially(self, self.layers, input_shape)

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/core/combinators.py:859, in build_sequentially(self, layers, input_shape)
    854             v = TypeError(
    855                 f"Couldn't build {layer}, "
    856                 f"did you forget to add aggregation to {last_layer}?"
    857             )
    858         six.reraise(t, v, tb)
--> 859     input_shape = layer.compute_output_shape(input_shape)
    860     last_layer = layer
    861 self.built = True

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/blocks/mlp.py:256, in _Dense.compute_output_shape(self, input_shape)
    253     agg = tabular_aggregation_registry.parse(self.pre_aggregation)
    254     input_shape = agg.compute_output_shape(input_shape)
--> 256 return self.dense.compute_output_shape(input_shape)

ValueError: The last dimension of the input shape of a Dense layer should be defined. Found None. Received: input_shape=(None, None)

Steps/Code to reproduce bug

import os
import itertools

import numpy as np
import tensorflow as tf

import merlin.models.tf as mm
from merlin.dataloader.ops.embeddings import EmbeddingOperator
from merlin.io import Dataset
from merlin.schema import Tags
from merlin.datasets.synthetic import generate_data

sequence_testing_data = generate_data("sequence-testing", num_rows=100)
sequence_testing_data.schema = sequence_testing_data.schema.select_by_tag(
        Tags.SEQUENCE
    ).select_by_tag(Tags.CATEGORICAL)
seq_schema = sequence_testing_data.schema

item_id_name = seq_schema.select_by_tag(Tags.ITEM).first.properties['domain']['name']

target = sequence_testing_data.schema.select_by_tag(Tags.ITEM_ID).column_names[0]

query_schema = seq_schema
output_schema = seq_schema.select_by_name(target)

d_model = 48
BATCH_SIZE = 32

dmodel = int(os.environ.get("dmodel", '48'))

input_block = mm.InputBlockV2(
    query_schema,    
    embeddings=mm.Embeddings(
        seq_schema.select_by_tag(Tags.CATEGORICAL), 
        sequence_combiner=None,
        dim=dmodel
        ))

xlnet_block = mm.XLNetBlock(d_model=dmodel, n_head=2, n_layer=2)

def get_output_block(schema, input_block=None):
    candidate_table = input_block["categorical"][item_id_name]
    to_call = candidate_table
    outputs = mm.CategoricalOutput(to_call=to_call)
    return outputs

output_block = get_output_block(seq_schema, input_block=input_block)

projection = mm.MLPBlock(
    [128, output_block.to_call.table.dim],
    no_activation_last_layer=True,
)
session_encoder = mm.Encoder(
    input_block,
    mm.MLPBlock([128, dmodel], no_activation_last_layer=True),
    xlnet_block,
    projection,
)
model = mm.RetrievalModelV2(query=session_encoder, output=output_block)
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.005,
)

loss = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True
)
model.compile(
    run_eagerly=False, 
    optimizer=optimizer, 
    loss=loss,
    metrics=mm.TopKMetricsAggregator.default_metrics(top_ks=[10])
)
model.fit(
    sequence_testing_data, 
    batch_size=32, 
    epochs=1, 
    pre=mm.SequenceMaskRandom(schema=seq_schema, target=target, masking_prob=0.3, transformer=xlnet_block)
)

model.save('./saved_model')

Once the model is saved, please restart the kernel, and load back the model with the following script:

import tensorflow as tf
import merlin.models.tf as mm
loaded_model = tf.keras.models.load_model('./saved_model')

Expected behavior

We should be able to load back the model and then do offline evaluation or predictions, accordingly.

Environment details

  • Merlin version:
  • Platform: Docker image
  • Python version:
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?): merlin-tensorflow:23.04 (I am on dev branch of all the repos).

Additional context

@oliverholworthy
Copy link
Member

I've narrowed it down to this PR #1022 that started causing this error on reload (in an different python process to the one that saved the model) ValueError: The last dimension of the input shape of a Dense layer should be defined

@rnyak
Copy link
Contributor Author

rnyak commented Jun 7, 2023

I get the same error if I run this code as a python script, instead of in a jupyter nb.

@sararb
Copy link
Contributor

sararb commented Jun 12, 2023

I debugged the reloading process of the save transformer model and noticed that when reloading the Jupyter notebook, the step of building the transformer block from the configuration (here) is skipped, resulting in the loss of input shape information. This lead to passing an input shape of (None, None) to the top MLP layer.
The reason behind this is that the build method of the transformer is skipped because it fails to find the custom object TFXLNetMainLayer, which is a class from the external Hugging Face library transformers.

I managed to avoid the error by importing transformers as well:

import merlin.models.tf as mm
import transformers
model = tf.keras.models.load_model('./saved_model')

@oliverholworthy
Copy link
Member

Looks like import merlin.models.tf as mm has a side-effect of importing the transformers module too, so this should already be loaded when calling the load_model function in this example.

from merlin.models.utils.dependencies import is_transformers_available

def is_transformers_available() -> bool:
try:
import transformers
except ImportError:
transformers = None
return transformers is not None

@rnyak
Copy link
Contributor Author

rnyak commented Jun 12, 2023

  • transformerblock is not loaded properly
  • user needs to do import transformers as well --> a quick, workaround solution.
  • can we make this automatic? can we fix is_transformers_available function?

@rnyak
Copy link
Contributor Author

rnyak commented Jun 13, 2023

Notes based on @oliverholworthy 's debugging:

  • This merged PR partially solves the issue. So it seems to be failing in TensorFlow versions 2.11, 2.10 (and earlier probably), but working with 2.12.
  • we run our test suite currently against TensorFlow versions {2.9, 2.10, 2.11, 2.12}
  • we dont know yet if this is a bug in TensorFlow or if there is still something we can change in our code to make this work in earlier TensorFlow versions.

@rnyak rnyak closed this as completed Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants