Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken PoC notebook due mismatching output names between config file and saved ranking model #112

Closed
wants to merge 1 commit into from

Conversation

rnyak
Copy link
Contributor

@rnyak rnyak commented Jun 2, 2022

Currently, 02-Deploying-multi-stage-RecSys-with-Merlin-Systems.ipynb nb is broken, due to ranking model output name mismatch between the config files, and saved model.

we get the following errors:

ValueError: Missing columns ['output_1'] found in operatorSubsetColumns during compute_input_schema.

This can be fixed by setting the proper output name click/click/binary_classification_task in the following lines:

ordering = combined_features["item_id"] >> SoftmaxSampling(
    relevance_col=ranking["click"], topk=top_k, temperature=20.0
)

However, this is not enough. Triton also complains about output names, therefoer cannot load 5_predicttensorflow model.

This PR proposes a solution to that issue.

@rnyak rnyak requested review from karlhigley and jperez999 June 2, 2022 22:35
@rnyak rnyak added the bug Something isn't working label Jun 3, 2022
Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of the output column names that will trigger this? Also Can we add a unittest that will test this out?

@rnyak
Copy link
Contributor Author

rnyak commented Jun 3, 2022

Do you have an example of the output column names that will trigger this? Also Can we add a unittest that will test this out?

yes the unit test is already available, it was just not in the correct folder. This PR NVIDIA-Merlin/Merlin#364 pushes it to the unit folder.

The output column name that will trigger this is click/binary_classification_task and it was output_1 before.

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #112 of commit 3b85d25ea82d561a3edecb64c553bb7c3e72bbad, no merge conflicts.
Running as SYSTEM
Setting status of 3b85d25ea82d561a3edecb64c553bb7c3e72bbad to PENDING with url https://10.20.13.93:8080/job/merlin_systems/64/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_systems
using credential fce1c729-5d7c-48e8-90cb-b0c314b1076e
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/systems # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/systems
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems user + githubtoken
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/systems +refs/pull/112/*:refs/remotes/origin/pr/112/* # timeout=10
 > git rev-parse 3b85d25ea82d561a3edecb64c553bb7c3e72bbad^{commit} # timeout=10
Checking out Revision 3b85d25ea82d561a3edecb64c553bb7c3e72bbad (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3b85d25ea82d561a3edecb64c553bb7c3e72bbad # timeout=10
Commit message: "fix output names"
 > git rev-list --no-walk bb98249cfd0f00b4b5ce8e0b130aca93b3834121 # timeout=10
[merlin_systems] $ /bin/bash /tmp/jenkins4224079119260067482.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_systems/systems, configfile: pyproject.toml
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 18 items / 1 skipped

tests/unit/test_version.py . [ 5%]
tests/unit/systems/test_ensemble.py FF.F [ 27%]
tests/unit/systems/test_ensemble_ops.py .. [ 38%]
tests/unit/systems/test_export.py . [ 44%]
tests/unit/systems/test_graph.py . [ 50%]
tests/unit/systems/test_inference_ops.py .. [ 61%]
tests/unit/systems/test_op_runner.py .... [ 83%]
tests/unit/systems/test_tensorflow_inf_op.py ... [100%]

=================================== FAILURES ===================================
______________ test_workflow_tf_e2e_config_verification[parquet] _______________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0')
dataset = <merlin.io.dataset.Dataset object at 0x7f15e675daf0>
engine = 'parquet'

@pytest.mark.skipif(not TRITON_SERVER_PATH, reason="triton server not found")
@pytest.mark.parametrize("engine", ["parquet"])
def test_workflow_tf_e2e_config_verification(tmpdir, dataset, engine):
    # Create a Workflow
    schema = dataset.schema
    for name in ["x", "y", "id"]:
        dataset.schema.column_schemas[name] = dataset.schema.column_schemas[name].with_tags(
            [Tags.USER]
        )
    selector = ColumnSelector(["x", "y", "id"])

    workflow_ops = selector >> wf_ops.Rename(postfix="_nvt")
    workflow = Workflow(workflow_ops["x_nvt"])
    workflow.fit(dataset)

    # Create Tensorflow Model
    model = tf.keras.models.Sequential(
        [
            tf.keras.Input(name="x_nvt", dtype=tf.float64, shape=(1,)),
            tf.keras.layers.Dense(16, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(1, name="output"),
        ]
    )
    model.compile(
        optimizer="adam",
        loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[tf.metrics.SparseCategoricalAccuracy()],
    )

    # Creating Triton Ensemble
    triton_chain = (
        selector >> TransformWorkflow(workflow, cats=["x_nvt"]) >> PredictTensorflow(model)
    )
    triton_ens = Ensemble(triton_chain, schema)

    # Creating Triton Ensemble Config
    ensemble_config, node_configs = triton_ens.export(str(tmpdir))

    config_path = tmpdir / "ensemble_model" / "config.pbtxt"

    # Checking Triton Ensemble Config
    with open(config_path, "rb") as f:
        config = model_config.ModelConfig()
        raw_config = f.read()
        parsed = text_format.Parse(raw_config, config)

        # The config file contents are correct
        assert parsed.name == "ensemble_model"
        assert parsed.platform == "ensemble"
        assert hasattr(parsed, "ensemble_scheduling")

    df = make_df({"x": [1.0, 2.0, 3.0], "y": [4.0, 5.0, 6.0], "id": [7, 8, 9]})

    output_columns = triton_ens.graph.output_schema.column_names
  response = _run_ensemble_on_tritonserver(str(tmpdir), output_columns, df, triton_ens.name)

tests/unit/systems/test_ensemble.py:113:


tests/unit/systems/utils/triton.py:39: in _run_ensemble_on_tritonserver
with run_triton_server(tmpdir) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0'

@contextlib.contextmanager
def run_triton_server(modelpath):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=-11)

merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stderr call -----------------------------
2022-06-03 18:48:29.180715: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-03 18:48:30.142476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-06-03 18:48:30.143219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15157 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
I0603 18:48:32.352452 14069 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0603 18:48:32.352535 14069 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0603 18:48:32.352543 14069 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0603 18:48:32.352548 14069 tensorflow.cc:2216] backend configuration:
{"cmdline":{"version":"2"}}
I0603 18:48:32.518372 14069 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f8f66000000' with size 268435456
I0603 18:48:32.519461 14069 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0603 18:48:32.527228 14069 model_repository_manager.cc:997] loading: 0_transformworkflow:1
I0603 18:48:32.627463 14069 model_repository_manager.cc:997] loading: 1_predicttensorflow:1
I0603 18:48:32.629477 14069 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0603 18:48:32.629506 14069 backend.cc:53] Triton TRITONBACKEND API version: 1.8
I0603 18:48:32.629515 14069 backend.cc:56] 'nvtabular' TRITONBACKEND API version: 1.8
I0603 18:48:32.629705 14069 backend.cc:76] Loaded libpython successfully
I0603 18:48:32.797830 14069 backend.cc:89] Python interpreter is initialized
I0603 18:48:32.798755 14069 tensorflow.cc:2276] TRITONBACKEND_ModelInitialize: 1_predicttensorflow (version 1)
I0603 18:48:32.799220 14069 model_inst_state.hpp:58] Loading TritonPythonModel from module 'merlin.systems.triton.workflow_model'
I0603 18:48:34.756820 14069 tensorflow.cc:2325] TRITONBACKEND_ModelInstanceInitialize: 1_predicttensorflow (GPU device 0)
I0603 18:48:34.756935 14069 model_repository_manager.cc:1152] successfully loaded '0_transformworkflow' version 1
2022-06-03 18:48:35.840173: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0/1_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:35.841433: I tensorflow/cc/saved_model/reader.cc:78] Reading meta graph with tags { serve }
2022-06-03 18:48:35.841459: I tensorflow/cc/saved_model/reader.cc:119] Reading SavedModel debug info (if present) from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0/1_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:35.841572: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-03 18:48:35.849338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 12344 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-06-03 18:48:35.883194: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle.
2022-06-03 18:48:35.912820: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0/1_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:35.921980: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 81821 microseconds.
I0603 18:48:35.923326 14069 tensorflow.cc:2363] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0603 18:48:35.923857 14069 tensorflow.cc:2302] TRITONBACKEND_ModelFinalize: delete model state
E0603 18:48:35.923883 14069 model_repository_manager.cc:1155] failed to load '1_predicttensorflow' version 1: Invalid argument: unexpected inference output 'output/BiasAdd:0', allowed outputs are: output
E0603 18:48:35.924106 14069 model_repository_manager.cc:1341] Invalid argument: ensemble 'ensemble_model' depends on '1_predicttensorflow' which has no loaded version
I0603 18:48:35.924222 14069 server.cc:524]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0603 18:48:35.925230 14069 server.cc:551]
+------------+-----------------------------------------------------------------+-----------------------------+
| Backend | Path | Config |
+------------+-----------------------------------------------------------------+-----------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"version":"2"}} |
| nvtabular | /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so | {} |
+------------+-----------------------------------------------------------------+-----------------------------+

I0603 18:48:35.925330 14069 server.cc:594]
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| 0_transformworkflow | 1 | READY |
| 1_predicttensorflow | 1 | UNAVAILABLE: Invalid argument: unexpected inference output 'output/BiasAdd:0', allowed outputs are: output |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+

I0603 18:48:35.974061 14069 metrics.cc:651] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0603 18:48:35.975837 14069 tritonserver.cc:1962]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.20.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_config_ve0 |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0603 18:48:35.975876 14069 server.cc:252] Waiting for in-flight requests to complete.
I0603 18:48:35.975889 14069 model_repository_manager.cc:1029] unloading: 0_transformworkflow:1
I0603 18:48:35.975936 14069 server.cc:267] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0603 18:48:35.976036 14069 backend.cc:160] TRITONBACKEND_ModelInstanceFinalize: delete instance state
------------------------------ Captured log call -------------------------------
WARNING tensorflow:load.py:167 No training configuration found in save file, so the model was not compiled. Compile it manually.
__________________ test_workflow_tf_e2e_multi_op_run[parquet] __________________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0')
dataset = <merlin.io.dataset.Dataset object at 0x7f15e61ad640>
engine = 'parquet'

@pytest.mark.skipif(not TRITON_SERVER_PATH, reason="triton server not found")
@pytest.mark.parametrize("engine", ["parquet"])
def test_workflow_tf_e2e_multi_op_run(tmpdir, dataset, engine):
    # Create a Workflow
    schema = dataset.schema
    for name in ["x", "y", "id"]:
        dataset.schema.column_schemas[name] = dataset.schema.column_schemas[name].with_tags(
            [Tags.USER]
        )

    workflow_ops = ["name-cat"] >> wf_ops.Categorify(cat_cache="host")
    workflow = Workflow(workflow_ops)
    workflow.fit(dataset)

    embedding_shapes_1 = wf_ops.get_embedding_sizes(workflow)

    cats = ["name-string"] >> wf_ops.Categorify(cat_cache="host")
    workflow_2 = Workflow(cats)
    workflow_2.fit(dataset)

    embedding_shapes = wf_ops.get_embedding_sizes(workflow_2)
    embedding_shapes_1.update(embedding_shapes)
    # Create Tensorflow Model
    model = create_tf_model(["name-cat", "name-string"], [], embedding_shapes_1)

    # Creating Triton Ensemble
    triton_chain_1 = ["name-cat"] >> TransformWorkflow(workflow)
    triton_chain_2 = ["name-string"] >> TransformWorkflow(workflow_2)
    triton_chain = (triton_chain_1 + triton_chain_2) >> PredictTensorflow(model)

    triton_ens = Ensemble(triton_chain, schema)

    # Creating Triton Ensemble Config
    ensemble_config, nodes_config = triton_ens.export(str(tmpdir))
    config_path = tmpdir / "ensemble_model" / "config.pbtxt"

    # Checking Triton Ensemble Config
    with open(config_path, "rb") as f:
        config = model_config.ModelConfig()
        raw_config = f.read()
        parsed = text_format.Parse(raw_config, config)

        # The config file contents are correct
        assert parsed.name == "ensemble_model"
        assert parsed.platform == "ensemble"
        assert hasattr(parsed, "ensemble_scheduling")

    df = dataset.to_ddf().compute()[["name-string", "name-cat"]].iloc[:3]
  response = _run_ensemble_on_tritonserver(str(tmpdir), ["output"], df, triton_ens.name)

tests/unit/systems/test_ensemble.py:166:


tests/unit/systems/utils/triton.py:39: in _run_ensemble_on_tritonserver
with run_triton_server(tmpdir) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0'

@contextlib.contextmanager
def run_triton_server(modelpath):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=-11)

merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stderr call -----------------------------
I0603 18:48:41.086839 14204 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0603 18:48:41.086935 14204 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0603 18:48:41.086943 14204 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0603 18:48:41.086948 14204 tensorflow.cc:2216] backend configuration:
{"cmdline":{"version":"2"}}
I0603 18:48:41.277739 14204 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f7f96000000' with size 268435456
I0603 18:48:41.278454 14204 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0603 18:48:41.282621 14204 model_repository_manager.cc:997] loading: 0_transformworkflow:1
I0603 18:48:41.382872 14204 model_repository_manager.cc:997] loading: 1_transformworkflow:1
I0603 18:48:41.386101 14204 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0603 18:48:41.386139 14204 backend.cc:53] Triton TRITONBACKEND API version: 1.8
I0603 18:48:41.386156 14204 backend.cc:56] 'nvtabular' TRITONBACKEND API version: 1.8
I0603 18:48:41.386384 14204 backend.cc:76] Loaded libpython successfully
I0603 18:48:41.483121 14204 model_repository_manager.cc:997] loading: 2_predicttensorflow:1
I0603 18:48:41.562721 14204 backend.cc:89] Python interpreter is initialized
I0603 18:48:41.564143 14204 model_inst_state.hpp:58] Loading TritonPythonModel from module 'merlin.systems.triton.workflow_model'
I0603 18:48:43.526182 14204 model_inst_state.hpp:58] Loading TritonPythonModel from module 'merlin.systems.triton.workflow_model'
I0603 18:48:43.526301 14204 model_repository_manager.cc:1152] successfully loaded '0_transformworkflow' version 1
I0603 18:48:43.531682 14204 tensorflow.cc:2276] TRITONBACKEND_ModelInitialize: 2_predicttensorflow (version 1)
I0603 18:48:43.531822 14204 model_repository_manager.cc:1152] successfully loaded '1_transformworkflow' version 1
I0603 18:48:43.533581 14204 tensorflow.cc:2325] TRITONBACKEND_ModelInstanceInitialize: 2_predicttensorflow (GPU device 0)
2022-06-03 18:48:44.591485: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0/2_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:44.593337: I tensorflow/cc/saved_model/reader.cc:78] Reading meta graph with tags { serve }
2022-06-03 18:48:44.593362: I tensorflow/cc/saved_model/reader.cc:119] Reading SavedModel debug info (if present) from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0/2_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:44.593475: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-03 18:48:44.597595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10318 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-06-03 18:48:44.643721: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle.
2022-06-03 18:48:44.704278: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0/2_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:44.716876: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 125406 microseconds.
I0603 18:48:44.718627 14204 tensorflow.cc:2363] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0603 18:48:44.719325 14204 tensorflow.cc:2302] TRITONBACKEND_ModelFinalize: delete model state
E0603 18:48:44.719347 14204 model_repository_manager.cc:1155] failed to load '2_predicttensorflow' version 1: Invalid argument: unexpected inference output 'output/Sigmoid:0', allowed outputs are: output
E0603 18:48:44.719414 14204 model_repository_manager.cc:1341] Invalid argument: ensemble 'ensemble_model' depends on '2_predicttensorflow' which has no loaded version
I0603 18:48:44.719554 14204 server.cc:524]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0603 18:48:44.719754 14204 server.cc:551]
+------------+-----------------------------------------------------------------+-----------------------------+
| Backend | Path | Config |
+------------+-----------------------------------------------------------------+-----------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"version":"2"}} |
| nvtabular | /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so | {} |
+------------+-----------------------------------------------------------------+-----------------------------+

I0603 18:48:44.720691 14204 server.cc:594]
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| 0_transformworkflow | 1 | READY |
| 1_transformworkflow | 1 | READY |
| 2_predicttensorflow | 1 | UNAVAILABLE: Invalid argument: unexpected inference output 'output/Sigmoid:0', allowed outputs are: output |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+

I0603 18:48:44.768434 14204 metrics.cc:651] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0603 18:48:44.771326 14204 tritonserver.cc:1962]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.20.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_0 |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0603 18:48:44.771379 14204 server.cc:252] Waiting for in-flight requests to complete.
I0603 18:48:44.771402 14204 model_repository_manager.cc:1029] unloading: 1_transformworkflow:1
I0603 18:48:44.771488 14204 model_repository_manager.cc:1029] unloading: 0_transformworkflow:1
I0603 18:48:44.771600 14204 server.cc:267] Timeout 30: Found 2 live models and 0 in-flight non-inference requests
I0603 18:48:44.771614 14204 backend.cc:160] TRITONBACKEND_ModelInstanceFinalize: delete instance state
------------------------------ Captured log call -------------------------------
WARNING absl:signature_serialization.py:146 Function _wrapped_model contains input name(s) name-cat, name-string with unsupported characters which will be renamed to name_cat, name_string in the SavedModel.
WARNING absl:save.py:133 <nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7f15e215b580> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
WARNING tensorflow:load.py:167 No training configuration found in save file, so the model was not compiled. Compile it manually.
WARNING absl:signature_serialization.py:146 Function _wrapped_model contains input name(s) name-cat, name-string with unsupported characters which will be renamed to name_cat, name_string in the SavedModel.
WARNING absl:save.py:133 <nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7f15e215b580> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
______________ test_workflow_tf_e2e_multi_op_plus_2_run[parquet] _______________

tmpdir = local('/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1')
dataset = <merlin.io.dataset.Dataset object at 0x7f15e23ba910>
engine = 'parquet'

@pytest.mark.skipif(not TRITON_SERVER_PATH, reason="triton server not found")
@pytest.mark.parametrize("engine", ["parquet"])
def test_workflow_tf_e2e_multi_op_plus_2_run(tmpdir, dataset, engine):
    # Create a Workflow
    schema = dataset.schema
    for name in ["x", "y", "id"]:
        dataset.schema.column_schemas[name] = dataset.schema.column_schemas[name].with_tags(
            [Tags.USER]
        )

    workflow_ops = ["name-cat"] >> wf_ops.Categorify(cat_cache="host")
    workflow = Workflow(workflow_ops)
    workflow.fit(dataset)

    embedding_shapes_1 = wf_ops.get_embedding_sizes(workflow)

    cats = ["name-string"] >> wf_ops.Categorify(cat_cache="host")
    workflow_2 = Workflow(cats)
    workflow_2.fit(dataset)

    embedding_shapes = wf_ops.get_embedding_sizes(workflow_2)
    embedding_shapes_1.update(embedding_shapes)
    embedding_shapes_1["name-string_plus_2"] = embedding_shapes_1["name-string"]

    # Create Tensorflow Model
    model = create_tf_model(["name-cat", "name-string_plus_2"], [], embedding_shapes_1)

    # Creating Triton Ensemble
    triton_chain_1 = ["name-cat"] >> TransformWorkflow(workflow)
    triton_chain_2 = ["name-string"] >> TransformWorkflow(workflow_2) >> PlusTwoOp()
    triton_chain = (triton_chain_1 + triton_chain_2) >> PredictTensorflow(model)

    triton_ens = Ensemble(triton_chain, schema)

    # Creating Triton Ensemble Config
    ensemble_config, nodes_config = triton_ens.export(str(tmpdir))
    config_path = tmpdir / "ensemble_model" / "config.pbtxt"

    # Checking Triton Ensemble Config
    with open(config_path, "rb") as f:
        config = model_config.ModelConfig()
        raw_config = f.read()
        parsed = text_format.Parse(raw_config, config)

        # The config file contents are correct
        assert parsed.name == "ensemble_model"
        assert parsed.platform == "ensemble"
        assert hasattr(parsed, "ensemble_scheduling")

    df = dataset.to_ddf().compute()[["name-string", "name-cat"]].iloc[:3]
  response = _run_ensemble_on_tritonserver(str(tmpdir), ["output"], df, triton_ens.name)

tests/unit/systems/test_ensemble.py:233:


tests/unit/systems/utils/triton.py:39: in _run_ensemble_on_tritonserver
with run_triton_server(tmpdir) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1'

@contextlib.contextmanager
def run_triton_server(modelpath):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=-11)

merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stderr call -----------------------------
I0603 18:48:49.289372 14399 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0603 18:48:49.289547 14399 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.8
I0603 18:48:49.289563 14399 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.8
I0603 18:48:49.289576 14399 tensorflow.cc:2216] backend configuration:
{"cmdline":{"version":"2"}}
I0603 18:48:49.482607 14399 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f102e000000' with size 268435456
I0603 18:48:49.483330 14399 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0603 18:48:49.488304 14399 model_repository_manager.cc:997] loading: 0_transformworkflow:1
I0603 18:48:49.588596 14399 model_repository_manager.cc:997] loading: 3_predicttensorflow:1
I0603 18:48:49.591773 14399 backend.cc:46] TRITONBACKEND_Initialize: nvtabular
I0603 18:48:49.591809 14399 backend.cc:53] Triton TRITONBACKEND API version: 1.8
I0603 18:48:49.591826 14399 backend.cc:56] 'nvtabular' TRITONBACKEND API version: 1.8
I0603 18:48:49.592049 14399 backend.cc:76] Loaded libpython successfully
I0603 18:48:49.688871 14399 model_repository_manager.cc:997] loading: 2_plustwoop:1
I0603 18:48:49.753971 14399 backend.cc:89] Python interpreter is initialized
I0603 18:48:49.754861 14399 tensorflow.cc:2276] TRITONBACKEND_ModelInitialize: 3_predicttensorflow (version 1)
I0603 18:48:49.755335 14399 model_inst_state.hpp:58] Loading TritonPythonModel from module 'merlin.systems.triton.workflow_model'
I0603 18:48:49.791297 14399 model_repository_manager.cc:997] loading: 1_transformworkflow:1
I0603 18:48:51.710991 14399 tensorflow.cc:2325] TRITONBACKEND_ModelInstanceInitialize: 3_predicttensorflow (GPU device 0)
I0603 18:48:51.711132 14399 model_repository_manager.cc:1152] successfully loaded '0_transformworkflow' version 1
2022-06-03 18:48:52.770191: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1/3_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:52.771707: I tensorflow/cc/saved_model/reader.cc:78] Reading meta graph with tags { serve }
2022-06-03 18:48:52.771730: I tensorflow/cc/saved_model/reader.cc:119] Reading SavedModel debug info (if present) from: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1/3_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:52.771845: I tensorflow/core/platform/cpu_feature_guard.cc:152] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-03 18:48:52.775931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10318 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-06-03 18:48:52.810275: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle.
2022-06-03 18:48:52.867499: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1/3_predicttensorflow/1/model.savedmodel
2022-06-03 18:48:52.879185: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 109008 microseconds.
I0603 18:48:52.880714 14399 tensorflow.cc:2363] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0603 18:48:52.881436 14399 tensorflow.cc:2302] TRITONBACKEND_ModelFinalize: delete model state
E0603 18:48:52.881459 14399 model_repository_manager.cc:1155] failed to load '3_predicttensorflow' version 1: Invalid argument: unexpected inference output 'output/Sigmoid:0', allowed outputs are: output
I0603 18:48:52.882354 14399 model_inst_state.hpp:64] Loading TritonPythonnModel from model.py in path '/tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1/2_plustwoop/1'
I0603 18:48:52.890094 14399 model_inst_state.hpp:58] Loading TritonPythonModel from module 'merlin.systems.triton.workflow_model'
I0603 18:48:52.890395 14399 model_repository_manager.cc:1152] successfully loaded '2_plustwoop' version 1
I0603 18:48:52.895661 14399 model_repository_manager.cc:1152] successfully loaded '1_transformworkflow' version 1
E0603 18:48:52.895737 14399 model_repository_manager.cc:1341] Invalid argument: ensemble 'ensemble_model' depends on '3_predicttensorflow' which has no loaded version
I0603 18:48:52.895836 14399 server.cc:524]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0603 18:48:52.896716 14399 server.cc:551]
+------------+-----------------------------------------------------------------+-----------------------------+
| Backend | Path | Config |
+------------+-----------------------------------------------------------------+-----------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"version":"2"}} |
| nvtabular | /opt/tritonserver/backends/nvtabular/libtriton_nvtabular.so | {} |
+------------+-----------------------------------------------------------------+-----------------------------+

I0603 18:48:52.896841 14399 server.cc:594]
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+
| 0_transformworkflow | 1 | READY |
| 1_transformworkflow | 1 | READY |
| 2_plustwoop | 1 | READY |
| 3_predicttensorflow | 1 | UNAVAILABLE: Invalid argument: unexpected inference output 'output/Sigmoid:0', allowed outputs are: output |
+---------------------+---------+------------------------------------------------------------------------------------------------------------+

I0603 18:48:52.941486 14399 metrics.cc:651] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0603 18:48:52.943102 14399 tritonserver.cc:1962]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.20.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/pytest-of-jenkins/pytest-1/test_workflow_tf_e2e_multi_op_1 |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0603 18:48:52.943130 14399 server.cc:252] Waiting for in-flight requests to complete.
I0603 18:48:52.943141 14399 model_repository_manager.cc:1029] unloading: 2_plustwoop:1
I0603 18:48:52.943184 14399 model_repository_manager.cc:1029] unloading: 1_transformworkflow:1
I0603 18:48:52.943224 14399 model_repository_manager.cc:1029] unloading: 0_transformworkflow:1
I0603 18:48:52.943290 14399 backend.cc:160] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0603 18:48:52.943334 14399 server.cc:267] Timeout 30: Found 3 live models and 0 in-flight non-inference requests
------------------------------ Captured log call -------------------------------
WARNING absl:signature_serialization.py:146 Function _wrapped_model contains input name(s) name-cat, name-string_plus_2 with unsupported characters which will be renamed to name_cat, name_string_plus_2 in the SavedModel.
WARNING absl:save.py:133 <nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7f15e61ad040> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
WARNING tensorflow:load.py:167 No training configuration found in save file, so the model was not compiled. Compile it manually.
WARNING absl:signature_serialization.py:146 Function _wrapped_model contains input name(s) name-cat, name-string_plus_2 with unsupported characters which will be renamed to name_cat, name_string_plus_2 in the SavedModel.
WARNING absl:save.py:133 <nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures object at 0x7f15e61ad040> has the same name 'DenseFeatures' as a built-in Keras object. Consider renaming <class 'nvtabular.framework_utils.tensorflow.layers.embedding.DenseFeatures'> to avoid naming conflicts when loading with tf.keras.models.load_model. If renaming is not possible, pass the object in the custom_objects parameter of the load function.
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18
/usr/local/lib/python3.8/dist-packages/nvtabular/framework_utils/init.py:18: DeprecationWarning: The nvtabular.framework_utils module is being replaced by the Merlin Models library. Support for importing from nvtabular.framework_utils is deprecated, and will be removed in a future version. Please consider using the models and layers from Merlin Models instead.
warnings.warn(

tests/unit/systems/test_ensemble.py: 7 warnings
tests/unit/systems/test_export.py: 1 warning
tests/unit/systems/test_inference_ops.py: 2 warnings
tests/unit/systems/test_op_runner.py: 4 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1292: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column x is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column y is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

tests/unit/systems/test_export.py::test_export_run_ensemble_triton[tensorflow-parquet]
/var/jenkins_home/workspace/merlin_systems/systems/merlin/systems/triton/export.py:304: UserWarning: Column id is being generated by NVTabular workflow but is unused in test_name_tf model
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_config_verification[parquet]
FAILED tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_multi_op_run[parquet]
FAILED tests/unit/systems/test_ensemble.py::test_workflow_tf_e2e_multi_op_plus_2_run[parquet]
============ 3 failed, 15 passed, 1 skipped, 18 warnings in 52.76s =============
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/systems/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_systems] $ /bin/bash /tmp/jenkins9312088608093551036.sh

@rnyak
Copy link
Contributor Author

rnyak commented Jun 13, 2022

closing due to #117

@rnyak rnyak closed this Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants