Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove upstream dependencies that have no outputs #107

Merged
merged 2 commits into from
Jul 25, 2022

Conversation

nv-alaiacano
Copy link
Contributor

This resolves the bug reported in NVIDIA-Merlin/NVTabular#1632

The issue was that we were successfully removing parents when calling remove_inputs but the node still showed up in the dependencies of the downstream nodes. We add logic to remove any dependency that does not have an output_schema.

I'll try adding a regression test as well.

@nv-alaiacano nv-alaiacano added the bug Something isn't working label Jul 22, 2022
@nv-alaiacano nv-alaiacano self-assigned this Jul 22, 2022
@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #107 of commit c5ffc695caf8ddcb2b9123386c51897534dfe8e2, no merge conflicts.
Running as SYSTEM
Setting status of c5ffc695caf8ddcb2b9123386c51897534dfe8e2 to PENDING with url https://10.20.13.93:8080/job/merlin_core/82/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/107/*:refs/remotes/origin/pr/107/* # timeout=10
 > git rev-parse c5ffc695caf8ddcb2b9123386c51897534dfe8e2^{commit} # timeout=10
Checking out Revision c5ffc695caf8ddcb2b9123386c51897534dfe8e2 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f c5ffc695caf8ddcb2b9123386c51897534dfe8e2 # timeout=10
Commit message: "remove upstream dependencies that have no outputs"
 > git rev-list --no-walk d74519e495606eeac3237ec814db4c9df31c80c6 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins2969909161860032254.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 342 items / 1 skipped

tests/unit/core/test_dispatch.py .. [ 0%]
tests/unit/dag/test_base_operator.py .... [ 1%]
tests/unit/dag/test_column_selector.py .......................... [ 9%]
tests/unit/dag/test_tags.py ...... [ 11%]
tests/unit/dag/ops/test_selection.py ... [ 11%]
tests/unit/io/test_io.py ............................................... [ 25%]
................................................................ [ 44%]
tests/unit/schema/test_column_schemas.py ............................... [ 53%]
........................................................................ [ 74%]
....................................................................... [ 95%]
tests/unit/schema/test_schema.py ...... [ 97%]
tests/unit/schema/test_schema_io.py .. [ 97%]
tests/unit/utils/test_utils.py ........ [100%]

=============================== warnings summary ===============================
tests/unit/dag/test_base_operator.py: 4 warnings
tests/unit/io/test_io.py: 71 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/io/test_io.py::test_validate_and_regenerate_dataset
/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.
paths = [p.path for p in pa_dataset.pieces]

tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 36103 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34807 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34479 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 45045 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 32789 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 46621 instead
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 342 passed, 1 skipped, 82 warnings in 50.49s =================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_core] $ /bin/bash /tmp/jenkins17788858930717798198.sh

if node.input_schema and len(node.input_schema):
output_columns_to_remove = node.remove_inputs(columns_to_remove)

for child in node.children:
nodes_to_process.append((child, to_remove + output_columns_to_remove))
nodes_to_process.append(
(child, list(set(to_remove + output_columns_to_remove)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line's change is just to de-duplicate the list of output columns. without it, we would end up with something like ["target", "target"] that would still technically work but made less sense when debugging this.

@karlhigley
Copy link
Contributor

LGTM once it has a test

@nvidia-merlin-bot
Copy link

Click to view CI Results
GitHub pull request #107 of commit 74969c31bbe8f94331db230748737996b4814547, no merge conflicts.
Running as SYSTEM
Setting status of 74969c31bbe8f94331db230748737996b4814547 to PENDING with url https://10.20.13.93:8080/job/merlin_core/83/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_core
using credential ce87ff3c-94f0-400a-8303-cb4acb4918b5
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/core # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/core
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems username and pass
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/core +refs/pull/107/*:refs/remotes/origin/pr/107/* # timeout=10
 > git rev-parse 74969c31bbe8f94331db230748737996b4814547^{commit} # timeout=10
Checking out Revision 74969c31bbe8f94331db230748737996b4814547 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 74969c31bbe8f94331db230748737996b4814547 # timeout=10
Commit message: "add regression test for Graph dependencies"
 > git rev-list --no-walk c5ffc695caf8ddcb2b9123386c51897534dfe8e2 # timeout=10
[merlin_core] $ /bin/bash /tmp/jenkins11609524086250452249.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_core/core, configfile: pyproject.toml
plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 343 items / 1 skipped

tests/unit/core/test_dispatch.py .. [ 0%]
tests/unit/dag/test_base_operator.py .... [ 1%]
tests/unit/dag/test_column_selector.py .......................... [ 9%]
tests/unit/dag/test_graph.py . [ 9%]
tests/unit/dag/test_tags.py ...... [ 11%]
tests/unit/dag/ops/test_selection.py ... [ 12%]
tests/unit/io/test_io.py ............................................... [ 25%]
................................................................ [ 44%]
tests/unit/schema/test_column_schemas.py ............................... [ 53%]
........................................................................ [ 74%]
....................................................................... [ 95%]
tests/unit/schema/test_schema.py ...... [ 97%]
tests/unit/schema/test_schema_io.py .. [ 97%]
tests/unit/utils/test_utils.py ........ [100%]

=============================== warnings summary ===============================
tests/unit/dag/test_base_operator.py: 4 warnings
tests/unit/io/test_io.py: 71 warnings
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(

tests/unit/io/test_io.py::test_validate_and_regenerate_dataset
/var/jenkins_home/workspace/merlin_core/core/merlin/io/parquet.py:551: DeprecationWarning: 'ParquetDataset.pieces' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. Specify 'use_legacy_dataset=False' while constructing the ParquetDataset, and then use the '.fragments' attribute instead.
paths = [p.path for p in pa_dataset.pieces]

tests/unit/utils/test_utils.py::test_nvt_distributed[True-True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34359 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[True-False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 34577 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 44345 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed[False-False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40567 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[True]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 36575 instead
warnings.warn(

tests/unit/utils/test_utils.py::test_nvt_distributed_force[False]
/usr/local/lib/python3.8/dist-packages/distributed/node.py:180: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 40963 instead
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================= 343 passed, 1 skipped, 82 warnings in 50.97s =================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/core/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_core] $ /bin/bash /tmp/jenkins1304071743833372603.sh

@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-107

@benfred
Copy link
Member

benfred commented Jul 23, 2022

I just noticed this PR - after I had already submitted a fix of my own #108 =(

@karlhigley karlhigley merged commit 1354dcf into NVIDIA-Merlin:main Jul 25, 2022
@radekosmulski
Copy link

@benfred @nv-alaiacano @karlhigley I would like to thank you all for this super timely fix! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants