Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move PoC unit test to unit folder and fix POC nb #364

Closed
wants to merge 2 commits into from

Conversation

rnyak
Copy link
Contributor

@rnyak rnyak commented Jun 2, 2022

This PR

  • moves POC unit tests to unit directory
  • edits the output column name in the ranking model so that it matches with the ranking model's output_names

This requires this PR to be merged first.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rnyak rnyak requested a review from jperez999 June 2, 2022 22:18
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #364 of commit 2757e798a23cff3108a976a56d1e0fd233daf6cb, no merge conflicts.
Running as SYSTEM
Setting status of 2757e798a23cff3108a976a56d1e0fd233daf6cb to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/132/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/364/*:refs/remotes/origin/pr/364/* # timeout=10
 > git rev-parse 2757e798a23cff3108a976a56d1e0fd233daf6cb^{commit} # timeout=10
Checking out Revision 2757e798a23cff3108a976a56d1e0fd233daf6cb (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 2757e798a23cff3108a976a56d1e0fd233daf6cb # timeout=10
Commit message: "delete categories"
 > git rev-list --no-walk 8c5e1345a56d7052d28631a55d3056f4e0877ffc # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins6351894645920927990.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F [100%]

=================================== FAILURES ===================================
__________________________________ test_func ___________________________________

tb1 = <testbook.client.TestbookNotebookClient object at 0x7fceefdc02b0>

@testbook(
    REPO_ROOT
    / "examples/Building-and-deploying-multi-stage-RecSys/01-Building-Recommender-Systems-with-Merlin.ipynb",
    execute=False,
)
def test_func(tb1):
    tb1.inject(
        """
        import os
        os.environ["DATA_FOLDER"] = "/tmp/data/"
        os.environ["NUM_ROWS"] = "10000"
        os.system("mkdir -p /tmp/examples")
        os.environ["BASE_DIR"] = "/tmp/examples/"
        """
    )
  tb1.execute()

tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:23:


../../../.local/lib/python3.8/site-packages/testbook/client.py:147: in execute
super().execute_cell(cell, index)
../../../.local/lib/python3.8/site-packages/nbclient/util.py:84: in wrapped
return just_run(coro(*args, **kwargs))
../../../.local/lib/python3.8/site-packages/nbclient/util.py:62: in just_run
return loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
return future.result()
../../../.local/lib/python3.8/site-packages/nbclient/client.py:965: in async_execute_cell
await self._check_raise_for_error(cell, cell_index, exec_reply)


self = <testbook.client.TestbookNotebookClient object at 0x7fceefdc02b0>
cell = {'cell_type': 'code', 'execution_count': 13, 'id': 'bfe2aa9e', 'metadata': {'execution': {'iopub.status.busy': '2022-0...m', run_eagerly=False, metrics=[tf.keras.metrics.AUC()])\nmodel.fit(train, validation_data=valid, batch_size=16*1024)"}
cell_index = 29
exec_reply = {'buffers': [], 'content': {'ename': 'TypeError', 'engine_info': {'engine_id': -1, 'engine_uuid': '206a701d-f2c1-46fa-...e, 'engine': '206a701d-f2c1-46fa-9a98-91a18400e79b', 'started': '2022-06-02T22:19:23.591571Z', 'status': 'error'}, ...}

async def _check_raise_for_error(
    self, cell: NotebookNode, cell_index: int, exec_reply: t.Optional[t.Dict]
) -> None:

    if exec_reply is None:
        return None

    exec_reply_content = exec_reply['content']
    if exec_reply_content['status'] != 'error':
        return None

    cell_allows_errors = (not self.force_raise_errors) and (
        self.allow_errors
        or exec_reply_content.get('ename') in self.allow_error_names
        or "raises-exception" in cell.metadata.get("tags", [])
    )
    await run_hook(self.on_cell_error, cell=cell, cell_index=cell_index)
    if not cell_allows_errors:
      raise CellExecutionError.from_cell_and_msg(cell, exec_reply_content)

E nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
E ------------------
E model.compile(optimizer='adam', run_eagerly=False, metrics=[tf.keras.metrics.AUC()])
E model.fit(train, validation_data=valid, batch_size=161024)
E ------------------
E
E �[0;31m---------------------------------------------------------------------------�[0m
E �[0;31mTypeError�[0m Traceback (most recent call last)
E Input �[0;32mIn [13]�[0m, in �[0;36m<cell line: 1>�[0;34m()�[0m
E �[0;32m----> 1�[0m �[43mmodel�[49m�[38;5;241;43m.�[39;49m�[43mcompile�[49m�[43m(�[49m�[43moptimizer�[49m�[38;5;241;43m=�[39;49m�[38;5;124;43m'�[39;49m�[38;5;124;43madam�[39;49m�[38;5;124;43m'�[39;49m�[43m,�[49m�[43m �[49m�[43mrun_eagerly�[49m�[38;5;241;43m=�[39;49m�[38;5;28;43;01mFalse�[39;49;00m�[43m,�[49m�[43m �[49m�[43mmetrics�[49m�[38;5;241;43m=�[39;49m�[43m[�[49m�[43mtf�[49m�[38;5;241;43m.�[39;49m�[43mkeras�[49m�[38;5;241;43m.�[39;49m�[43mmetrics�[49m�[38;5;241;43m.�[39;49m�[43mAUC�[49m�[43m(�[49m�[43m)�[49m�[43m]�[49m�[43m)�[49m
E �[1;32m 2�[0m model�[38;5;241m.�[39mfit(train, validation_data�[38;5;241m=�[39mvalid, batch_size�[38;5;241m=�[39m�[38;5;241m16�[39m�[38;5;241m
�[39m�[38;5;241m1024�[39m)
E
E File �[0;32m/usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/base.py:332�[0m, in �[0;36mModel.compile�[0;34m(self, optimizer, loss, metrics, loss_weights, weighted_metrics, run_eagerly, steps_per_execution, jit_compile, **kwargs)�[0m
E �[1;32m 329�[0m �[38;5;28;01mif�[39;00m �[38;5;28misinstance�[39m(_loss[key], �[38;5;28mstr�[39m) �[38;5;129;01mand�[39;00m _loss[key] �[38;5;129;01min�[39;00m loss_registry:
E �[1;32m 330�[0m _loss[key] �[38;5;241m=�[39m loss_registry�[38;5;241m.�[39mparse(_loss[key])
E �[0;32m--> 332�[0m �[38;5;28;43msuper�[39;49m�[43m(�[49m�[43mModel�[49m�[43m,�[49m�[43m �[49m�[38;5;28;43mself�[39;49m�[43m)�[49m�[38;5;241;43m.�[39;49m�[43mcompile�[49m�[43m(�[49m
E �[1;32m 333�[0m �[43m �[49m�[43moptimizer�[49m�[38;5;241;43m=�[39;49m�[43moptimizer�[49m�[43m,�[49m
E �[1;32m 334�[0m �[43m �[49m�[43mloss�[49m�[38;5;241;43m=�[39;49m�[43m_loss�[49m�[43m,�[49m
E �[1;32m 335�[0m �[43m �[49m�[43mmetrics�[49m�[38;5;241;43m=�[39;49m�[43m_metrics�[49m�[43m,�[49m
E �[1;32m 336�[0m �[43m �[49m�[43mweighted_metrics�[49m�[38;5;241;43m=�[39;49m�[43mweighted_metrics�[49m�[43m,�[49m
E �[1;32m 337�[0m �[43m �[49m�[43mrun_eagerly�[49m�[38;5;241;43m=�[39;49m�[43mrun_eagerly�[49m�[43m,�[49m
E �[1;32m 338�[0m �[43m �[49m�[43mloss_weights�[49m�[38;5;241;43m=�[39;49m�[43mloss_weights�[49m�[43m,�[49m
E �[1;32m 339�[0m �[43m �[49m�[43msteps_per_execution�[49m�[38;5;241;43m=�[39;49m�[43msteps_per_execution�[49m�[43m,�[49m
E �[1;32m 340�[0m �[43m �[49m�[43mjit_compile�[49m�[38;5;241;43m=�[39;49m�[43mjit_compile�[49m�[43m,�[49m
E �[1;32m 341�[0m �[43m �[49m�[38;5;241;43m�[39;49m�[38;5;241;43m�[39;49m�[43mkwargs�[49m�[43m,�[49m
E �[1;32m 342�[0m �[43m�[49m�[43m)�[49m
E
E File �[0;32m/usr/local/lib/python3.8/dist-packages/keras/engine/training.py:555�[0m, in �[0;36mModel.compile�[0;34m(self, optimizer, loss, metrics, loss_weights, weighted_metrics, run_eagerly, steps_per_execution, **kwargs)�[0m
E �[1;32m 550�[0m �[38;5;66;03m# When compiling from an already-serialized model, we do not want to�[39;00m
E �[1;32m 551�[0m �[38;5;66;03m# reapply some processing steps (e.g. metric renaming for multi-output�[39;00m
E �[1;32m 552�[0m �[38;5;66;03m# models, which have prefixes added for each corresponding output name).�[39;00m
E �[1;32m 553�[0m from_serialized �[38;5;241m=�[39m kwargs�[38;5;241m.�[39mpop(�[38;5;124m'�[39m�[38;5;124mfrom_serialized�[39m�[38;5;124m'�[39m, �[38;5;28;01mFalse�[39;00m)
E �[0;32m--> 555�[0m �[38;5;28;43mself�[39;49m�[38;5;241;43m.�[39;49m�[43m_validate_compile�[49m�[43m(�[49m�[43moptimizer�[49m�[43m,�[49m�[43m �[49m�[43mmetrics�[49m�[43m,�[49m�[43m �[49m�[38;5;241;43m�[39;49m�[38;5;241;43m�[39;49m�[43mkwargs�[49m�[43m)�[49m
E �[1;32m 556�[0m �[38;5;28mself�[39m�[38;5;241m.�[39m_run_eagerly �[38;5;241m=�[39m run_eagerly
E �[1;32m 558�[0m �[38;5;28mself�[39m�[38;5;241m.�[39moptimizer �[38;5;241m=�[39m �[38;5;28mself�[39m�[38;5;241m.�[39m_get_optimizer(optimizer)
E
E File �[0;32m/usr/local/lib/python3.8/dist-packages/keras/engine/training.py:2717�[0m, in �[0;36mModel._validate_compile�[0;34m(self, optimizer, metrics, **kwargs)�[0m
E �[1;32m 2715�[0m invalid_kwargs �[38;5;241m=�[39m �[38;5;28mset�[39m(kwargs) �[38;5;241m-�[39m {�[38;5;124m'�[39m�[38;5;124msample_weight_mode�[39m�[38;5;124m'�[39m}
E �[1;32m 2716�[0m �[38;5;28;01mif�[39;00m invalid_kwargs:
E �[0;32m-> 2717�[0m �[38;5;28;01mraise�[39;00m �[38;5;167;01mTypeError�[39;00m(�[38;5;124m'�[39m�[38;5;124mInvalid keyword argument(s) in compile: �[39m�[38;5;132;01m%s�[39;00m�[38;5;124m'�[39m �[38;5;241m%�[39m
E �[1;32m 2718�[0m (invalid_kwargs,))
E �[1;32m 2720�[0m �[38;5;66;03m# Model must be created and compiled with the same DistStrat.�[39;00m
E �[1;32m 2721�[0m �[38;5;28;01mif�[39;00m �[38;5;28mself�[39m�[38;5;241m.�[39mbuilt �[38;5;129;01mand�[39;00m tf�[38;5;241m.�[39mdistribute�[38;5;241m.�[39mhas_strategy():
E
E �[0;31mTypeError�[0m: Invalid keyword argument(s) in compile: {'jit_compile'}
E TypeError: Invalid keyword argument(s) in compile: {'jit_compile'}

../../../.local/lib/python3.8/site-packages/nbclient/client.py:862: CellExecutionError
----------------------------- Captured stderr call -----------------------------
2022-06-02 22:19:19.414601: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-06-02 22:19:21.417467: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 0
2022-06-02 22:19:21.417625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-06-02 22:19:21.418381: I tensorflow/core/common_runtime/gpu/gpu_process_state.cc:222] Using CUDA malloc Async allocator for GPU: 1
2022-06-02 22:19:21.418437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15157 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python3.8/logging/init.py", line 2127, in shutdown
h.close()
File "/usr/local/lib/python3.8/dist-packages/absl/logging/init.py", line 951, in close
self.stream.close()
File "/usr/local/lib/python3.8/dist-packages/ipykernel/iostream.py", line 445, in close
self.watch_fd_thread.join()
AttributeError: 'OutStream' object has no attribute 'watch_fd_thread'
=========================== short test summary info ============================
FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func
========================= 1 failed, 1 passed in 37.50s =========================
Build step 'Execute shell' marked build as failure
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins4821672215182737569.sh

@github-actions
Copy link

github-actions bot commented Jun 2, 2022

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-364

@rnyak rnyak requested review from benfred and bschifferer June 2, 2022 22:37
@bschifferer
Copy link
Contributor

I dont think that this PR will execute the unittests for each CI run.

@benfred @jperez999 In my understanding, the nightly CI will execute this bash script:
https://github.com/NVIDIA-Merlin/Merlin/blob/main/ci/test_container.sh

is that correct?

So we need to add a section

## Test Merlin Repository
echo "Run unit tests for Merlin"
cd /Merlin && ci/test_unit.sh $container $devices

and we need to add a test_unit.sh to the Merlin repository. @benfred @jperez999 can you confirm?

@karlhigley
Copy link
Contributor

The multi-stage recommender example requires both Merlin Models and Merlin Systems, and therefore isn't well suited to a unit test (i.e. a test of a single "unit," like a library.) Since this requires multiple libraries—even though the notebooks live in their respectively most relevant repos—this would be better tested with an multi-library integration test in the Merlin repo (where we can safely assume that depending on anything in the Merlin ecosystem is fair game.)

@rnyak
Copy link
Contributor Author

rnyak commented Jun 7, 2022

@karlhigley Thanks for your comment. FWIW, this is a unit test, not integration test, and it actually runs without issues (if the PoC's second nb error is fixed) at my env. This unit test runs fine because multi-stage recommender example runs in the same container (everything runs in the inference container) whereas this is not the situation for Serving-Ranking-Models-With-Merlin-Systems.ipynb example.
Maybe I am missing something with multi-library integration test statement, but I think this is more suitable statement for this #310 PR instead?

if we do not have a unit test for this release, we will still be not detecting errors with PoC example.. :/

@jperez999
Copy link
Collaborator

for this to succeed we need to first have this PR merged in systems: NVIDIA-Merlin/systems#117

@rnyak
Copy link
Contributor Author

rnyak commented Jun 13, 2022

closing due to #385

@rnyak rnyak closed this Jun 13, 2022
@rnyak rnyak deleted the fix_unit_test_and_POC branch June 14, 2022 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants