Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unit scaling criteo inference serving #559

Merged
merged 6 commits into from
Aug 25, 2022

Conversation

jperez999
Copy link
Collaborator

Newer versions of triton do not allow startup from within the context of a notebook, that is within the context of testbooks. This fix is to remedy those issues.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jperez999 jperez999 self-assigned this Aug 25, 2022
@jperez999 jperez999 added bug Something isn't working chore Infrastructure update breaking Breaking change ci labels Aug 25, 2022
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #559 of commit 18f8854e9c7149f93bd7447ee1150020c1faf000, no merge conflicts.
Running as SYSTEM
Setting status of 18f8854e9c7149f93bd7447ee1150020c1faf000 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/368/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/559/*:refs/remotes/origin/pr/559/* # timeout=10
 > git rev-parse 18f8854e9c7149f93bd7447ee1150020c1faf000^{commit} # timeout=10
Checking out Revision 18f8854e9c7149f93bd7447ee1150020c1faf000 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 18f8854e9c7149f93bd7447ee1150020c1faf000 # timeout=10
Commit message: "fix unit test for scaling criteo"
 > git rev-list --no-walk d8ab03429179e7dd1467123a0334d9dbf9875576 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins9875516084636874203.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 3 items

tests/unit/test_version.py . [ 33%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py s [ 66%]
tests/unit/examples/test_scaling_criteo_merlin_models.py F [100%]

=================================== FAILURES ===================================
__________________________________ test_func ___________________________________

self = <testbook.client.TestbookNotebookClient object at 0x7f7e30450cd0>
cell = {'id': '426844fb', 'cell_type': 'code', 'metadata': {'execution': {'iopub.status.busy': '2022-08-25T14:43:39.494692Z',...ut/criteo/day_0.parquet')\nvalid.to_ddf().compute().to_parquet('/tmp/input/criteo/day_1.parquet')\n', 'outputs': []}
cell_index = 27, execution_count = None, store_history = True

async def async_execute_cell(
    self,
    cell: NotebookNode,
    cell_index: int,
    execution_count: t.Optional[int] = None,
    store_history: bool = True,
) -> NotebookNode:
    """
    Executes a single code cell.

    To execute all cells see :meth:`execute`.

    Parameters
    ----------
    cell : nbformat.NotebookNode
        The cell which is currently being processed.
    cell_index : int
        The position of the cell within the notebook object.
    execution_count : int
        The execution count to be assigned to the cell (default: Use kernel response)
    store_history : bool
        Determines if history should be stored in the kernel (default: False).
        Specific to ipython kernels, which can store command histories.

    Returns
    -------
    output : dict
        The execution output payload (or None for no output).

    Raises
    ------
    CellExecutionError
        If execution failed and should raise an exception, this will be raised
        with defaults about the failure.

    Returns
    -------
    cell : NotebookNode
        The cell which was just processed.
    """
    assert self.kc is not None

    await run_hook(self.on_cell_start, cell=cell, cell_index=cell_index)

    if cell.cell_type != 'code' or not cell.source.strip():
        self.log.debug("Skipping non-executing cell %s", cell_index)
        return cell

    if self.skip_cells_with_tag in cell.metadata.get("tags", []):
        self.log.debug("Skipping tagged cell %s", cell_index)
        return cell

    if self.record_timing:  # clear execution metadata prior to execution
        cell['metadata']['execution'] = {}

    self.log.debug("Executing cell:\n%s", cell.source)

    cell_allows_errors = (not self.force_raise_errors) and (
        self.allow_errors or "raises-exception" in cell.metadata.get("tags", [])
    )

    await run_hook(self.on_cell_execute, cell=cell, cell_index=cell_index)
    parent_msg_id = await ensure_async(
        self.kc.execute(
            cell.source, store_history=store_history, stop_on_error=not cell_allows_errors
        )
    )
    await run_hook(self.on_cell_complete, cell=cell, cell_index=cell_index)
    # We launched a code cell to execute
    self.code_cells_executed += 1
    exec_timeout = self._get_timeout(cell)

    cell.outputs = []
    self.clear_before_next_output = False

    task_poll_kernel_alive = asyncio.ensure_future(self._async_poll_kernel_alive())
    task_poll_output_msg = asyncio.ensure_future(
        self._async_poll_output_msg(parent_msg_id, cell, cell_index)
    )
    self.task_poll_for_reply = asyncio.ensure_future(
        self._async_poll_for_reply(
            parent_msg_id, cell, exec_timeout, task_poll_output_msg, task_poll_kernel_alive
        )
    )
    try:
      exec_reply = await self.task_poll_for_reply

E asyncio.exceptions.CancelledError

/usr/local/lib/python3.8/dist-packages/nbclient/client.py:1006: CancelledError

During handling of the above exception, another exception occurred:

def test_func():
    with testbook(
        REPO_ROOT / "examples" / "scaling-criteo" / "02-ETL-with-NVTabular.ipynb",
        execute=False,
        timeout=180,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["BASE_DIR"] = "/tmp/input/criteo/"
            os.environ["INPUT_DATA_DIR"] = "/tmp/input/criteo/"
            os.environ["OUTPUT_DATA_DIR"] = "/tmp/output/criteo/"
            os.system("mkdir -p /tmp/input/criteo")
            os.system("mkdir -p /tmp/output/criteo")

            from merlin.datasets.synthetic import generate_data

            train, valid = generate_data("criteo", int(1000000), set_sizes=(0.7, 0.3))

            train.to_ddf().compute().to_parquet('/tmp/input/criteo/day_0.parquet')
            valid.to_ddf().compute().to_parquet('/tmp/input/criteo/day_1.parquet')
            """
        )
      tb1.execute()

tests/unit/examples/test_scaling_criteo_merlin_models.py:36:


/usr/local/lib/python3.8/dist-packages/testbook/client.py:147: in execute
super().execute_cell(cell, index)
/usr/local/lib/python3.8/dist-packages/nbclient/util.py:85: in wrapped
return just_run(coro(*args, **kwargs))
/usr/local/lib/python3.8/dist-packages/nbclient/util.py:60: in just_run
return loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
return future.result()


self = <testbook.client.TestbookNotebookClient object at 0x7f7e30450cd0>
cell = {'id': '426844fb', 'cell_type': 'code', 'metadata': {'execution': {'iopub.status.busy': '2022-08-25T14:43:39.494692Z',...ut/criteo/day_0.parquet')\nvalid.to_ddf().compute().to_parquet('/tmp/input/criteo/day_1.parquet')\n', 'outputs': []}
cell_index = 27, execution_count = None, store_history = True

async def async_execute_cell(
    self,
    cell: NotebookNode,
    cell_index: int,
    execution_count: t.Optional[int] = None,
    store_history: bool = True,
) -> NotebookNode:
    """
    Executes a single code cell.

    To execute all cells see :meth:`execute`.

    Parameters
    ----------
    cell : nbformat.NotebookNode
        The cell which is currently being processed.
    cell_index : int
        The position of the cell within the notebook object.
    execution_count : int
        The execution count to be assigned to the cell (default: Use kernel response)
    store_history : bool
        Determines if history should be stored in the kernel (default: False).
        Specific to ipython kernels, which can store command histories.

    Returns
    -------
    output : dict
        The execution output payload (or None for no output).

    Raises
    ------
    CellExecutionError
        If execution failed and should raise an exception, this will be raised
        with defaults about the failure.

    Returns
    -------
    cell : NotebookNode
        The cell which was just processed.
    """
    assert self.kc is not None

    await run_hook(self.on_cell_start, cell=cell, cell_index=cell_index)

    if cell.cell_type != 'code' or not cell.source.strip():
        self.log.debug("Skipping non-executing cell %s", cell_index)
        return cell

    if self.skip_cells_with_tag in cell.metadata.get("tags", []):
        self.log.debug("Skipping tagged cell %s", cell_index)
        return cell

    if self.record_timing:  # clear execution metadata prior to execution
        cell['metadata']['execution'] = {}

    self.log.debug("Executing cell:\n%s", cell.source)

    cell_allows_errors = (not self.force_raise_errors) and (
        self.allow_errors or "raises-exception" in cell.metadata.get("tags", [])
    )

    await run_hook(self.on_cell_execute, cell=cell, cell_index=cell_index)
    parent_msg_id = await ensure_async(
        self.kc.execute(
            cell.source, store_history=store_history, stop_on_error=not cell_allows_errors
        )
    )
    await run_hook(self.on_cell_complete, cell=cell, cell_index=cell_index)
    # We launched a code cell to execute
    self.code_cells_executed += 1
    exec_timeout = self._get_timeout(cell)

    cell.outputs = []
    self.clear_before_next_output = False

    task_poll_kernel_alive = asyncio.ensure_future(self._async_poll_kernel_alive())
    task_poll_output_msg = asyncio.ensure_future(
        self._async_poll_output_msg(parent_msg_id, cell, cell_index)
    )
    self.task_poll_for_reply = asyncio.ensure_future(
        self._async_poll_for_reply(
            parent_msg_id, cell, exec_timeout, task_poll_output_msg, task_poll_kernel_alive
        )
    )
    try:
        exec_reply = await self.task_poll_for_reply
    except asyncio.CancelledError:
        # can only be cancelled by task_poll_kernel_alive when the kernel is dead
        task_poll_output_msg.cancel()
      raise DeadKernelError("Kernel died")

E nbclient.exceptions.DeadKernelError: Kernel died

/usr/local/lib/python3.8/dist-packages/nbclient/client.py:1010: DeadKernelError
----------------------------- Captured stderr call -----------------------------
2022-08-25 14:43:30,908 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-08-25 14:43:30,932 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
terminate called after throwing an instance of 'rmm::out_of_memory'
what(): std::bad_alloc: out_of_memory: CUDA error at: /usr/include/rmm/mr/device/cuda_memory_resource.hpp:70: cudaErrorMemoryAllocation out of memory
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 12 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
------------------------------ Captured log call -------------------------------
ERROR traitlets:client.py:863 Kernel died while waiting for execute reply.
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/examples/test_scaling_criteo_merlin_models.py::test_func - ...
============= 1 failed, 1 passed, 1 skipped, 35 warnings in 20.35s =============
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins7049271451736486476.sh

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-559

@jperez999
Copy link
Collaborator Author

rerun tests

@jperez999 jperez999 changed the title Fix unit multi stage deploy inference serving Fix unit scaling criteo inference serving Aug 25, 2022
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #559 of commit 18f8854e9c7149f93bd7447ee1150020c1faf000, no merge conflicts.
Running as SYSTEM
Setting status of 18f8854e9c7149f93bd7447ee1150020c1faf000 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/369/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/559/*:refs/remotes/origin/pr/559/* # timeout=10
 > git rev-parse 18f8854e9c7149f93bd7447ee1150020c1faf000^{commit} # timeout=10
Checking out Revision 18f8854e9c7149f93bd7447ee1150020c1faf000 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 18f8854e9c7149f93bd7447ee1150020c1faf000 # timeout=10
Commit message: "fix unit test for scaling criteo"
 > git rev-list --no-walk 18f8854e9c7149f93bd7447ee1150020c1faf000 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins7113214151575744866.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 3 items

tests/unit/test_version.py . [ 33%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py s [ 66%]
tests/unit/examples/test_scaling_criteo_merlin_models.py F [100%]

=================================== FAILURES ===================================
__________________________________ test_func ___________________________________

def test_func():
    with testbook(
        REPO_ROOT / "examples" / "scaling-criteo" / "02-ETL-with-NVTabular.ipynb",
        execute=False,
        timeout=180,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["BASE_DIR"] = "/tmp/input/criteo/"
            os.environ["INPUT_DATA_DIR"] = "/tmp/input/criteo/"
            os.environ["OUTPUT_DATA_DIR"] = "/tmp/output/criteo/"
            os.system("mkdir -p /tmp/input/criteo")
            os.system("mkdir -p /tmp/output/criteo")

            from merlin.datasets.synthetic import generate_data

            train, valid = generate_data("criteo", int(1000000), set_sizes=(0.7, 0.3))

            train.to_ddf().compute().to_parquet('/tmp/input/criteo/day_0.parquet')
            valid.to_ddf().compute().to_parquet('/tmp/input/criteo/day_1.parquet')
            """
        )
        tb1.execute()
        assert os.path.isfile("/tmp/output/criteo/train/part_0.parquet")
        assert os.path.isfile("/tmp/output/criteo/valid/part_0.parquet")
        assert os.path.isfile("/tmp/output/criteo/workflow/metadata.json")

    with testbook(
        REPO_ROOT
        / "examples"
        / "scaling-criteo"
        / "03-Training-with-Merlin-Models-TensorFlow.ipynb",
        execute=False,
        timeout=180,
    ) as tb2:
        tb2.inject(
            """
            import os
            os.environ["INPUT_DATA_DIR"] = "/tmp/output/criteo/"
            """
        )
        tb2.execute()
        metrics = tb2.ref("eval_metrics")
        assert set(metrics.keys()) == set(
            [
                "auc",
                "binary_accuracy",
                "loss",
                "precision",
                "recall",
                "regularization_loss",
            ]
        )
        assert os.path.isfile("/tmp/output/criteo/dlrm/saved_model.pb")

    with testbook(
        REPO_ROOT
        / "examples"
        / "scaling-criteo"
        / "04-Triton-Inference-with-Merlin-Models-TensorFlow.ipynb",
        execute=False,
        timeout=180,
    ) as tb3:
        tb3.inject(
            """
            import os
            os.environ["BASE_DIR"] = "/tmp/output/criteo/"
            os.environ["INPUT_FOLDER"] = "/tmp/input/criteo/"
            """
        )
        NUM_OF_CELLS = len(tb3.cells)
        tb3.execute_cell(list(range(0, NUM_OF_CELLS - 5)))
        input_cols = tb3.ref("input_cols")
        outputs = tb3.ref("output_cols")
        # read in data for request
        df_lib = get_lib()
        in_dtypes = {}
        for col in input_cols:
            if col.startswith("C"):
                in_dtypes[col] = "int64"
            if col.startswith("I"):
                in_dtypes[col] = "float64"
        batch = df_lib.read_parquet(
            os.path.join("/tmp/output/criteo/", "valid", "part_0.parquet"),
            num_rows=3,
            columns=input_cols,
        )
        batch = batch.astype(in_dtypes)
        configure_tensorflow()
      response = run_ensemble_on_tritonserver(
            "/tmp/output/criteo/ensemble/", outputs, batch, "ensemble_model"
        )

tests/unit/examples/test_scaling_criteo_merlin_models.py:103:


/usr/local/lib/python3.8/dist-packages/merlin/systems/triton/utils.py:92: in run_ensemble_on_tritonserver
with run_triton_server(tmpdir) as client:
/usr/lib/python3.8/contextlib.py:113: in enter
return next(self.gen)


modelpath = '/tmp/output/criteo/ensemble/'

@contextlib.contextmanager
def run_triton_server(modelpath):
    """This function starts up a Triton server instance and returns a client to it.

    Parameters
    ----------
    modelpath : string
        The path to the model to load.

    Yields
    ------
    client: tritonclient.InferenceServerClient
        The client connected to the Triton server.

    """
    cmdline = [
        TRITON_SERVER_PATH,
        "--model-repository",
        modelpath,
        "--backend-config=tensorflow,version=2",
    ]
    env = os.environ.copy()
    env["CUDA_VISIBLE_DEVICES"] = "0"
    with subprocess.Popen(cmdline, env=env) as process:
        try:
            with grpcclient.InferenceServerClient("localhost:8001") as client:
                # wait until server is ready
                for _ in range(60):
                    if process.poll() is not None:
                        retcode = process.returncode
                      raise RuntimeError(f"Tritonserver failed to start (ret={retcode})")

E RuntimeError: Tritonserver failed to start (ret=1)

/usr/local/lib/python3.8/dist-packages/merlin/systems/triton/utils.py:46: RuntimeError
----------------------------- Captured stderr call -----------------------------
2022-08-25 14:48:55,752 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-08-25 14:48:55,766 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 27 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
2022-08-25 14:49:13.352448: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-25 14:49:15.459725: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2022-08-25 14:49:15.459867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-08-25 14:49:15.460678: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 1
2022-08-25 14:49:15.460728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown
h.close()
File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close
self.stream.close()
File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close
self.watch_fd_thread.join()
AttributeError: 'OutStream' object has no attribute 'watch_fd_thread'
2022-08-25 14:49:51.117986: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-25 14:49:53.164650: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2022-08-25 14:49:53.164796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14880 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-08-25 14:49:53.165632: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 1
2022-08-25 14:49:53.165683: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
I0825 14:50:09.259489 2672 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f1bc0000000' with size 268435456
I0825 14:50:09.260274 2672 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0825 14:50:09.264351 2672 model_repository_manager.cc:1191] loading: 1_predicttensorflow:1
I0825 14:50:09.364700 2672 model_repository_manager.cc:1191] loading: 0_transformworkflow:1
I0825 14:50:09.650113 2672 tensorflow.cc:2204] TRITONBACKEND_Initialize: tensorflow
I0825 14:50:09.650155 2672 tensorflow.cc:2214] Triton TRITONBACKEND API version: 1.10
I0825 14:50:09.650162 2672 tensorflow.cc:2220] 'tensorflow' TRITONBACKEND API version: 1.10
I0825 14:50:09.650168 2672 tensorflow.cc:2244] backend configuration:
{"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}}
I0825 14:50:09.650212 2672 tensorflow.cc:2310] TRITONBACKEND_ModelInitialize: 1_predicttensorflow (version 1)
I0825 14:50:09.655407 2672 tensorflow.cc:2359] TRITONBACKEND_ModelInstanceInitialize: 1_predicttensorflow (GPU device 0)
2022-08-25 14:50:10.011568: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /tmp/output/criteo/ensemble/1_predicttensorflow/1/model.savedmodel
2022-08-25 14:50:10.027707: I tensorflow/cc/saved_model/reader.cc:81] Reading meta graph with tags { serve }
2022-08-25 14:50:10.027759: I tensorflow/cc/saved_model/reader.cc:122] Reading SavedModel debug info (if present) from: /tmp/output/criteo/ensemble/1_predicttensorflow/1/model.savedmodel
2022-08-25 14:50:10.027891: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-25 14:50:10.065314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13776 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-08-25 14:50:10.135588: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-08-25 14:50:10.142843: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle.
2022-08-25 14:50:10.413723: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /tmp/output/criteo/ensemble/1_predicttensorflow/1/model.savedmodel
2022-08-25 14:50:10.488625: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 477076 microseconds.
I0825 14:50:10.507852 2672 tensorflow.cc:2397] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0825 14:50:10.507908 2672 tensorflow.cc:2336] TRITONBACKEND_ModelFinalize: delete model state
E0825 14:50:10.507944 2672 model_repository_manager.cc:1348] failed to load '1_predicttensorflow' version 1: Invalid argument: unexpected inference input 'C1', allowed inputs are: args_0, args_0_1, args_0_10, args_0_11, args_0_12, args_0_13, args_0_14, args_0_15, args_0_16, args_0_17, args_0_18, args_0_19, args_0_2, args_0_20, args_0_21, args_0_22, args_0_23, args_0_24, args_0_25, args_0_26, args_0_27, args_0_28, args_0_29, args_0_3, args_0_30, args_0_31, args_0_32, args_0_33, args_0_34, args_0_35, args_0_36, args_0_37, args_0_38, args_0_4, args_0_5, args_0_6, args_0_7, args_0_8, args_0_9
I0825 14:50:10.511201 2672 python_be.cc:1774] TRITONBACKEND_ModelInstanceInitialize: 0_transformworkflow (GPU device 0)
I0825 14:50:14.930794 2672 model_repository_manager.cc:1345] successfully loaded '0_transformworkflow' version 1
E0825 14:50:14.930907 2672 model_repository_manager.cc:1551] Invalid argument: ensemble 'ensemble_model' depends on '1_predicttensorflow' which has no loaded version
I0825 14:50:14.931019 2672 server.cc:556]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0825 14:50:14.931132 2672 server.cc:583]
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorflow | /opt/tritonserver/backends/tensorflow2/libtriton_tensorflow2.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","version":"2","default-max-batch-size":"4"}} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
+------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0825 14:50:14.931322 2672 server.cc:626]
+---------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
+---------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0_transformworkflow | 1 | READY |
| 1_predicttensorflow | 1 | UNAVAILABLE: Invalid argument: unexpected inference input 'C1', allowed inputs are: args_0, args_0_1, args_0_10, args_0_11, args_0_12, args_0_13, args_0_14, args_0_15, args_0_16, args_0_17, args_0_18, args_0_19, args_0_2, args_0_20, args_0_21, args_0_22, args_0_23, args_0_24, args_0_25, args_0_26, args_0_27, args_0_28, args_0_29, args_0_3, args_0_30, args_0_31, args_0_32, args_0_33, args_0_34, args_0_35, args_0_36, args_0_37, args_0_38, args_0_4, args_0_5, ar |
| | | gs_0_6, args_0_7, args_0_8, args_0_9 |
+---------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0825 14:50:14.995644 2672 metrics.cc:650] Collecting metrics for GPU 0: Tesla P100-DGXS-16GB
I0825 14:50:14.996540 2672 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.23.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0] | /tmp/output/criteo/ensemble/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0825 14:50:14.996575 2672 server.cc:257] Waiting for in-flight requests to complete.
I0825 14:50:14.996583 2672 server.cc:273] Timeout 30: Found 0 model versions that have in-flight inferences
I0825 14:50:14.996594 2672 model_repository_manager.cc:1223] unloading: 0_transformworkflow:1
I0825 14:50:14.996630 2672 server.cc:288] All models are stopped, unloading models
I0825 14:50:14.996639 2672 server.cc:295] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0825 14:50:15.996724 2672 server.cc:295] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
W0825 14:50:16.013750 2672 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
I0825 14:50:16.578288 2672 model_repository_manager.cc:1328] successfully unloaded '0_transformworkflow' version 1
I0825 14:50:16.996857 2672 server.cc:295] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
W0825 14:50:17.013939 2672 metrics.cc:468] Unable to get energy consumption for GPU 0. Status:Success, value:0
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown
h.close()
File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 934, in close
self.stream.close()
File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 438, in close
self.watch_fd_thread.join()
AttributeError: 'OutStream' object has no attribute 'watch_fd_thread'
=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/unit/examples/test_scaling_criteo_merlin_models.py::test_func - ...
======== 1 failed, 1 passed, 1 skipped, 35 warnings in 96.47s (0:01:36) ========
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins4202871427140035698.sh

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #559 of commit 3832ea55a7cc44dce1f693d5e718ddc49a12f1a6, no merge conflicts.
Running as SYSTEM
Setting status of 3832ea55a7cc44dce1f693d5e718ddc49a12f1a6 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/370/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/559/*:refs/remotes/origin/pr/559/* # timeout=10
 > git rev-parse 3832ea55a7cc44dce1f693d5e718ddc49a12f1a6^{commit} # timeout=10
Checking out Revision 3832ea55a7cc44dce1f693d5e718ddc49a12f1a6 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3832ea55a7cc44dce1f693d5e718ddc49a12f1a6 # timeout=10
Commit message: "add back model import"
 > git rev-list --no-walk 18f8854e9c7149f93bd7447ee1150020c1faf000 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins3194441305650325862.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 3 items

tests/unit/test_version.py . [ 33%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py s [ 66%]
tests/unit/examples/test_scaling_criteo_merlin_models.py . [100%]

=============================== warnings summary ===============================
../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33
/usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
DASK_VERSION = LooseVersion(dask.version)

../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings
/var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============ 2 passed, 1 skipped, 35 warnings in 111.70s (0:01:51) =============
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins3902650011334290476.sh

@karlhigley karlhigley merged commit c12dbac into NVIDIA-Merlin:main Aug 25, 2022
@viswa-nvidia viswa-nvidia added this to the Merlin 22.09 milestone Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working chore Infrastructure update ci
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants