Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch over HDFS build/install scripts #434

Merged
merged 7 commits into from
Jul 8, 2022

Conversation

bashimao
Copy link
Contributor

@bashimao bashimao commented Jul 5, 2022

Remove scripts containing HDFS build/install steps, and instead use the ones maintained by HugeCTR team.

@github-actions
Copy link

github-actions bot commented Jul 5, 2022

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-434

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #434 of commit 9fbac883e0ba23f656a28adb95b3713984b62641, no merge conflicts.
Running as SYSTEM
Setting status of 9fbac883e0ba23f656a28adb95b3713984b62641 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/226/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/434/*:refs/remotes/origin/pr/434/* # timeout=10
 > git rev-parse 9fbac883e0ba23f656a28adb95b3713984b62641^{commit} # timeout=10
Checking out Revision 9fbac883e0ba23f656a28adb95b3713984b62641 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 9fbac883e0ba23f656a28adb95b3713984b62641 # timeout=10
Commit message: "Use HDFS install scripts from HugeCTR instead."
 > git rev-list --no-walk 971bf7cc12d57c279c87dd64892fa2c9f26cbac1 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins16486705290728548184.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py . [100%]

======================== 2 passed in 133.61s (0:02:13) =========================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins17864245004698965007.sh

@jperez999 jperez999 added enhancement New feature or request ci labels Jul 7, 2022
@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #434 of commit 98f7f723fc38e6534be5649d54c1606331823d95, no merge conflicts.
Running as SYSTEM
Setting status of 98f7f723fc38e6534be5649d54c1606331823d95 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/241/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/434/*:refs/remotes/origin/pr/434/* # timeout=10
 > git rev-parse 98f7f723fc38e6534be5649d54c1606331823d95^{commit} # timeout=10
Checking out Revision 98f7f723fc38e6534be5649d54c1606331823d95 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 98f7f723fc38e6534be5649d54c1606331823d95 # timeout=10
Commit message: "Merge branch 'main' into switch-over-docker"
 > git rev-list --no-walk b03563e281d251e7c39ad7d4cd33fab346126673 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins4133322523398397119.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py . [100%]

======================== 2 passed in 144.61s (0:02:24) =========================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins16281461495801299830.sh

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #434 of commit e53010abc87330db036f4f6c7adf67278b7e040b, no merge conflicts.
Running as SYSTEM
Setting status of e53010abc87330db036f4f6c7adf67278b7e040b to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/246/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/434/*:refs/remotes/origin/pr/434/* # timeout=10
 > git rev-parse e53010abc87330db036f4f6c7adf67278b7e040b^{commit} # timeout=10
Checking out Revision e53010abc87330db036f4f6c7adf67278b7e040b (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f e53010abc87330db036f4f6c7adf67278b7e040b # timeout=10
Commit message: "Merge branch 'main' into switch-over-docker"
 > git rev-list --no-walk bc1862d49e5683bd84a01321a299aea90eae62a5 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins14721996627769857792.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py F [100%]

=================================== FAILURES ===================================
__________________________________ test_func ___________________________________

self = <testbook.client.TestbookNotebookClient object at 0x7efb38eeb8b0>
cell = {'cell_type': 'code', 'execution_count': 20, 'id': 'd6703d7c-d38f-4d6d-a20a-9ee95ff1e256', 'metadata': {'execution': {...cs=[mm.RecallAt(10), mm.NDCGAt(10)],\n)\nmodel.fit(train_tt, validation_data=valid_tt, batch_size=1024 * 8, epochs=1)'}
cell_index = 41, execution_count = None, store_history = True

async def async_execute_cell(
    self,
    cell: NotebookNode,
    cell_index: int,
    execution_count: t.Optional[int] = None,
    store_history: bool = True,
) -> NotebookNode:
    """
    Executes a single code cell.

    To execute all cells see :meth:`execute`.

    Parameters
    ----------
    cell : nbformat.NotebookNode
        The cell which is currently being processed.
    cell_index : int
        The position of the cell within the notebook object.
    execution_count : int
        The execution count to be assigned to the cell (default: Use kernel response)
    store_history : bool
        Determines if history should be stored in the kernel (default: False).
        Specific to ipython kernels, which can store command histories.

    Returns
    -------
    output : dict
        The execution output payload (or None for no output).

    Raises
    ------
    CellExecutionError
        If execution failed and should raise an exception, this will be raised
        with defaults about the failure.

    Returns
    -------
    cell : NotebookNode
        The cell which was just processed.
    """
    assert self.kc is not None

    await run_hook(self.on_cell_start, cell=cell, cell_index=cell_index)

    if cell.cell_type != 'code' or not cell.source.strip():
        self.log.debug("Skipping non-executing cell %s", cell_index)
        return cell

    if self.skip_cells_with_tag in cell.metadata.get("tags", []):
        self.log.debug("Skipping tagged cell %s", cell_index)
        return cell

    if self.record_timing:  # clear execution metadata prior to execution
        cell['metadata']['execution'] = {}

    self.log.debug("Executing cell:\n%s", cell.source)

    cell_allows_errors = (not self.force_raise_errors) and (
        self.allow_errors or "raises-exception" in cell.metadata.get("tags", [])
    )

    await run_hook(self.on_cell_execute, cell=cell, cell_index=cell_index)
    parent_msg_id = await ensure_async(
        self.kc.execute(
            cell.source, store_history=store_history, stop_on_error=not cell_allows_errors
        )
    )
    await run_hook(self.on_cell_complete, cell=cell, cell_index=cell_index)
    # We launched a code cell to execute
    self.code_cells_executed += 1
    exec_timeout = self._get_timeout(cell)

    cell.outputs = []
    self.clear_before_next_output = False

    task_poll_kernel_alive = asyncio.ensure_future(self._async_poll_kernel_alive())
    task_poll_output_msg = asyncio.ensure_future(
        self._async_poll_output_msg(parent_msg_id, cell, cell_index)
    )
    self.task_poll_for_reply = asyncio.ensure_future(
        self._async_poll_for_reply(
            parent_msg_id, cell, exec_timeout, task_poll_output_msg, task_poll_kernel_alive
        )
    )
    try:
      exec_reply = await self.task_poll_for_reply

E asyncio.exceptions.CancelledError

../../../.local/lib/python3.8/site-packages/nbclient/client.py:949: CancelledError

During handling of the above exception, another exception occurred:

def test_func():
    with testbook(
        REPO_ROOT
        / "examples"
        / "Building-and-deploying-multi-stage-RecSys"
        / "01-Building-Recommender-Systems-with-Merlin.ipynb",
        execute=False,
    ) as tb1:
        tb1.inject(
            """
            import os
            os.environ["DATA_FOLDER"] = "/tmp/data/"
            os.environ["NUM_ROWS"] = "10000"
            os.system("mkdir -p /tmp/examples")
            os.environ["BASE_DIR"] = "/tmp/examples/"
            """
        )
      tb1.execute()

tests/unit/examples/test_building_deploying_multi_stage_RecSys.py:31:


../../../.local/lib/python3.8/site-packages/testbook/client.py:147: in execute
super().execute_cell(cell, index)
../../../.local/lib/python3.8/site-packages/nbclient/util.py:84: in wrapped
return just_run(coro(*args, **kwargs))
../../../.local/lib/python3.8/site-packages/nbclient/util.py:62: in just_run
return loop.run_until_complete(coro)
/usr/lib/python3.8/asyncio/base_events.py:616: in run_until_complete
return future.result()


self = <testbook.client.TestbookNotebookClient object at 0x7efb38eeb8b0>
cell = {'cell_type': 'code', 'execution_count': 20, 'id': 'd6703d7c-d38f-4d6d-a20a-9ee95ff1e256', 'metadata': {'execution': {...cs=[mm.RecallAt(10), mm.NDCGAt(10)],\n)\nmodel.fit(train_tt, validation_data=valid_tt, batch_size=1024 * 8, epochs=1)'}
cell_index = 41, execution_count = None, store_history = True

async def async_execute_cell(
    self,
    cell: NotebookNode,
    cell_index: int,
    execution_count: t.Optional[int] = None,
    store_history: bool = True,
) -> NotebookNode:
    """
    Executes a single code cell.

    To execute all cells see :meth:`execute`.

    Parameters
    ----------
    cell : nbformat.NotebookNode
        The cell which is currently being processed.
    cell_index : int
        The position of the cell within the notebook object.
    execution_count : int
        The execution count to be assigned to the cell (default: Use kernel response)
    store_history : bool
        Determines if history should be stored in the kernel (default: False).
        Specific to ipython kernels, which can store command histories.

    Returns
    -------
    output : dict
        The execution output payload (or None for no output).

    Raises
    ------
    CellExecutionError
        If execution failed and should raise an exception, this will be raised
        with defaults about the failure.

    Returns
    -------
    cell : NotebookNode
        The cell which was just processed.
    """
    assert self.kc is not None

    await run_hook(self.on_cell_start, cell=cell, cell_index=cell_index)

    if cell.cell_type != 'code' or not cell.source.strip():
        self.log.debug("Skipping non-executing cell %s", cell_index)
        return cell

    if self.skip_cells_with_tag in cell.metadata.get("tags", []):
        self.log.debug("Skipping tagged cell %s", cell_index)
        return cell

    if self.record_timing:  # clear execution metadata prior to execution
        cell['metadata']['execution'] = {}

    self.log.debug("Executing cell:\n%s", cell.source)

    cell_allows_errors = (not self.force_raise_errors) and (
        self.allow_errors or "raises-exception" in cell.metadata.get("tags", [])
    )

    await run_hook(self.on_cell_execute, cell=cell, cell_index=cell_index)
    parent_msg_id = await ensure_async(
        self.kc.execute(
            cell.source, store_history=store_history, stop_on_error=not cell_allows_errors
        )
    )
    await run_hook(self.on_cell_complete, cell=cell, cell_index=cell_index)
    # We launched a code cell to execute
    self.code_cells_executed += 1
    exec_timeout = self._get_timeout(cell)

    cell.outputs = []
    self.clear_before_next_output = False

    task_poll_kernel_alive = asyncio.ensure_future(self._async_poll_kernel_alive())
    task_poll_output_msg = asyncio.ensure_future(
        self._async_poll_output_msg(parent_msg_id, cell, cell_index)
    )
    self.task_poll_for_reply = asyncio.ensure_future(
        self._async_poll_for_reply(
            parent_msg_id, cell, exec_timeout, task_poll_output_msg, task_poll_kernel_alive
        )
    )
    try:
        exec_reply = await self.task_poll_for_reply
    except asyncio.CancelledError:
        # can only be cancelled by task_poll_kernel_alive when the kernel is dead
        task_poll_output_msg.cancel()
      raise DeadKernelError("Kernel died")

E nbclient.exceptions.DeadKernelError: Kernel died

../../../.local/lib/python3.8/site-packages/nbclient/client.py:953: DeadKernelError
----------------------------- Captured stderr call -----------------------------
2022-07-07 23:55:23.162885: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-07 23:55:25.175411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1627 MB memory: -> device: 0, name: Tesla P100-DGXS-16GB, pci bus id: 0000:07:00.0, compute capability: 6.0
2022-07-07 23:55:25.176143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15153 MB memory: -> device: 1, name: Tesla P100-DGXS-16GB, pci bus id: 0000:08:00.0, compute capability: 6.0
------------------------------ Captured log call -------------------------------
ERROR traitlets:client.py:808 Kernel died while waiting for execute reply.
=========================== short test summary info ============================
FAILED tests/unit/examples/test_building_deploying_multi_stage_RecSys.py::test_func
========================= 1 failed, 1 passed in 28.88s =========================
Terminated
Build was aborted
Aborted by �[8mha:////4I6AZwo/1Z8Fal8AhZTEatjIwqNwCcqT21311HdysuK+AAAAlx+LCAAAAAAAAP9b85aBtbiIQTGjNKU4P08vOT+vOD8nVc83PyU1x6OyILUoJzMv2y+/JJUBAhiZGBgqihhk0NSjKDWzXb3RdlLBUSYGJk8GtpzUvPSSDB8G5tKinBIGIZ+sxLJE/ZzEvHT94JKizLx0a6BxUmjGOUNodHsLgAzWEgZu/dLi1CL9xJTczDwAj6GcLcAAAAA=�[0madmin
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins1922831465838632623.sh

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #434 of commit 3593a68a2240153791ff8c96a671756ec9ff0f47, no merge conflicts.
Running as SYSTEM
Setting status of 3593a68a2240153791ff8c96a671756ec9ff0f47 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/247/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/434/*:refs/remotes/origin/pr/434/* # timeout=10
 > git rev-parse 3593a68a2240153791ff8c96a671756ec9ff0f47^{commit} # timeout=10
Checking out Revision 3593a68a2240153791ff8c96a671756ec9ff0f47 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 3593a68a2240153791ff8c96a671756ec9ff0f47 # timeout=10
Commit message: "Merge branch 'main' into switch-over-docker"
 > git rev-list --no-walk e53010abc87330db036f4f6c7adf67278b7e040b # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins16539016535425486066.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: anyio-3.5.0, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 2 items

tests/unit/test_version.py . [ 50%]
tests/unit/examples/test_building_deploying_multi_stage_RecSys.py . [100%]

======================== 2 passed in 146.07s (0:02:26) =========================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins14023875530921467858.sh

@jperez999 jperez999 merged commit 91f848c into NVIDIA-Merlin:main Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants