Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-gpu 2.0 make compatible for k8s #2434

Merged
merged 2 commits into from
Nov 9, 2018

Conversation

CodeJuan
Copy link
Contributor

@CodeJuan CodeJuan commented Nov 6, 2018

Ⅰ. Describe what this PR did

k8s-device-plugin uses environment variables to specify a GPU accelerated container, so we should support it.
In order to be compatible with k8s, pouchd should set nvidia prestart hook if nvidia environment variable was set by user.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗?)

Ⅳ. Describe how to verify it

  1. Prerequisites:
    • An instance with a NVIDIA GPU
    • The appropriate GPU driver is installed
  2. Run GPU container with nvidia env.
pouch run -it -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all centos:7 bash
  1. Exec nvidia-smi in container.
  2. If nvidia-smi return error NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system, then add a link to libnvidia-ml.so
version=`nvidia-smi --help | head -n 1 | awk -F "-- v" '{print $2}'`
ln -s  /usr/lib64/libnvidia-cfg.so.1 /usr/lib64/libnvidia-cfg.so
ln -s /usr/lib64/libnvidia-cfg.so."$version" /usr/lib64/libnvidia-cfg.so.1
ln -s  /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so
ln -s /usr/lib64/libnvidia-ml.so."$version" /usr/lib64/libnvidia-ml.so.1

Ⅴ. Special notes for reviews

Signed-off-by: codejuan <xh@decbug.com>
@codecov
Copy link

codecov bot commented Nov 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@401e132). Click here to learn what that means.
The diff coverage is 72.72%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2434   +/-   ##
=========================================
  Coverage          ?   68.88%           
=========================================
  Files             ?      277           
  Lines             ?    18221           
  Branches          ?        0           
=========================================
  Hits              ?    12552           
  Misses            ?     4250           
  Partials          ?     1419
Flag Coverage Δ
#criv1alpha1test 31.62% <31.81%> (?)
#criv1alpha2test 35.78% <31.81%> (?)
#integrationtest 40.13% <31.81%> (?)
#nodee2etest 33.16% <31.81%> (?)
#unittest 26.73% <72.72%> (?)
Impacted Files Coverage Δ
daemon/mgr/spec_hook.go 22.72% <0%> (ø)
daemon/mgr/spec_nvidia_hook.go 76.19% <76.19%> (ø)

@CLAassistant
Copy link

CLAassistant commented Nov 6, 2018

CLA assistant check
All committers have signed the CLA.

@CodeJuan CodeJuan force-pushed the sync_gpu2.0 branch 2 times, most recently from 4215eb2 to 9b18298 Compare November 6, 2018 07:50
fullname := path.Join(installDir, nvidiaHookName)
os.Remove(fullname)
os.Create(fullname)
os.Chmod(fullname, 0755)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding defer os.Remove(fullname) here? delete the test-nvidia-container-runtime-hook after test finished

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is defer function at line22.
Sometimes test was broken unexpected, we should delete the mock file at first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry master Xiong, my fault

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it easy man 😁

)

var (
nvidiaHookName = "nvidia-container-runtime-hook"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add more description to this variable. Is this is local file when executing hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both nvidia-container-runtime-hook and nvidia-container-cli was packaged in pouch rpm.

@@ -0,0 +1,36 @@
# PouchContainer with GPU

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the graph, I think we should add a general introduction of the usage of PouchContainer with GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@CodeJuan CodeJuan force-pushed the sync_gpu2.0 branch 4 times, most recently from c1ead44 to 70c2561 Compare November 6, 2018 08:30
@zhuangqh
Copy link
Contributor

zhuangqh commented Nov 6, 2018

UT fails @CodeJuan

=== RUN   Test_setNvidiaHook
--- FAIL: Test_setNvidiaHook (0.00s)
	spec_nvidia_hook_test.go:108: setNvidiaHook = exec: "test-nvidia-container-runtime-hook": executable file not found in $PATH, want <nil>
	spec_nvidia_hook_test.go:111: setNvidiaHook = [], want [{ [ prestart] [] <nil>}]
	spec_nvidia_hook_test.go:108: setNvidiaHook = exec: "test-nvidia-container-runtime-hook": executable file not found in $PATH, want <nil>
	spec_nvidia_hook_test.go:111: setNvidiaHook = [], want [{ [ prestart] [] <nil>}]
FAIL
coverage: 10.8% of statements
FAIL	github.com/alibaba/pouch/daemon/mgr	0.593s
make: *** [unit-test] Error 1

@CodeJuan CodeJuan force-pushed the sync_gpu2.0 branch 3 times, most recently from c54fb26 to bbde654 Compare November 7, 2018 06:52
Signed-off-by: codejuan <xh@decbug.com>
@zhuangqh
Copy link
Contributor

zhuangqh commented Nov 9, 2018

LGTM

@zhuangqh zhuangqh merged commit d16badf into AliyunContainerService:master Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants