-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fix HDFS linking #151
Conversation
bashimao
commented
Mar 15, 2022
- Contains the necessary fixing the existing HDFS issues for HugeCTR.
- Also, meticulously dissects all dependencies and builds them in a - hopefully - sensible manner.
Ready for initial review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashimao can you resolve the merge conflicts?
This looks like its a substantial change over just updating HDFS - which makes it much harder to review. Can we split this into multiple PR's ? The first one being a smaller PR focused on just HDFS linking, and then later ones with other improvements?
For instance, this PR is not just changing how we build/install HDFS - but also has changed how we build NVTabular / Transformers4rec and how we are handling dependencies for those libraries. Since this doesn't also update the TF/PT dockerfiles, this means we have two different ways of installing these libraries, and we'll have to maintain both right now.
As you mentioned:
I think this should be 2 different PRs. It is quite hard to review. In addition, all dockerfiles should follow the same style. Also, I guess we are gonna need a meeting, as there are substancial changes. |
Will do. (Sorry, my half of Shanghai was/still is in complete lockdown, with daily testing.) |
That might be possible. Let me give it some thought.
I kind of feared you would say that. As said, I came in from an angle where we had to take the container apart slice by slice to figure out what was wrong. Consequently, I then build the solution by piecing it together slice by slice. But having said that, I can understand when it feels a little bit heavy as a change.
Arguably, yes. |
5e5a66c
to
7ecf3cc
Compare
Click to view CI ResultsGitHub pull request #151 of commit 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8, no merge conflicts. Running as SYSTEM Setting status of 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/7/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_merlin using credential systems-login > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/151/*:refs/remotes/origin/pr/151/* # timeout=10 > git rev-parse 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8^{commit} # timeout=10 Checking out Revision 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 # timeout=10 Commit message: "Less invasive approach to marry HugeCTR and Merlin container config." > git rev-list --no-walk 94f230c7cc1d8f95e5c285eb91ed97ed4534e700 # timeout=10 [merlin_merlin] $ /bin/bash /tmp/jenkins7458613170408113306.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1 item |
@benfred @albert17 I only modified the training docker for now. Please let me know what you think. if you are OK with this approach I will modify the inference container accordingly. As for the other issue with duplicate files. We can only really solve this by either agreeing to have a specific @EvenOldridge FYI, feel free to comment. |
This comment was marked as resolved.
This comment was marked as resolved.
Documentation preview |
@benfred @EvenOldridge Both images adjusted. Please re-check. |
Click to view CI ResultsGitHub pull request #151 of commit e44afdad30e6214c0ed4163264090d9f5a5c4914, no merge conflicts. Running as SYSTEM Setting status of e44afdad30e6214c0ed4163264090d9f5a5c4914 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/45/console and message: 'Pending' Using context: Jenkins Building on master in workspace /var/jenkins_home/workspace/merlin_merlin using credential systems-login > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin > git --version # timeout=10 using GIT_ASKPASS to set credentials login for merlin-systems > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/151/*:refs/remotes/origin/pr/151/* # timeout=10 > git rev-parse e44afdad30e6214c0ed4163264090d9f5a5c4914^{commit} # timeout=10 Checking out Revision e44afdad30e6214c0ed4163264090d9f5a5c4914 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f e44afdad30e6214c0ed4163264090d9f5a5c4914 # timeout=10 Commit message: "Adjust inference image accordingly." > git rev-list --no-walk 22f3f689866b23a96d7f48b817e7713cf8904412 # timeout=10 [merlin_merlin] $ /bin/bash /tmp/jenkins5179122541206650314.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1 item |
Definitely it looks much better now. @jperez999 What do you think? |