nvidia-gpu 2.0 make compatible for k8s #2434

CodeJuan · 2018-11-06T07:27:53Z

Ⅰ. Describe what this PR did

k8s-device-plugin uses environment variables to specify a GPU accelerated container, so we should support it.
In order to be compatible with k8s, pouchd should set nvidia prestart hook if nvidia environment variable was set by user.

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗？)

Ⅳ. Describe how to verify it

Prerequisites:
- An instance with a NVIDIA GPU
- The appropriate GPU driver is installed
Run GPU container with nvidia env.

pouch run -it -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all centos:7 bash

Exec nvidia-smi in container.
If nvidia-smi return error NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system, then add a link to libnvidia-ml.so

version=`nvidia-smi --help | head -n 1 | awk -F "-- v" '{print $2}'`
ln -s  /usr/lib64/libnvidia-cfg.so.1 /usr/lib64/libnvidia-cfg.so
ln -s /usr/lib64/libnvidia-cfg.so."$version" /usr/lib64/libnvidia-cfg.so.1
ln -s  /usr/lib64/libnvidia-ml.so.1 /usr/lib64/libnvidia-ml.so
ln -s /usr/lib64/libnvidia-ml.so."$version" /usr/lib64/libnvidia-ml.so.1

Ⅴ. Special notes for reviews

Signed-off-by: codejuan <xh@decbug.com>

codecov · 2018-11-06T07:27:55Z

Codecov Report

❗ No coverage uploaded for pull request base (master@401e132). Click here to learn what that means.
The diff coverage is 72.72%.

@@            Coverage Diff            @@
##             master    #2434   +/-   ##
=========================================
  Coverage          ?   68.88%           
=========================================
  Files             ?      277           
  Lines             ?    18221           
  Branches          ?        0           
=========================================
  Hits              ?    12552           
  Misses            ?     4250           
  Partials          ?     1419

Flag	Coverage Δ
#criv1alpha1test	`31.62% <31.81%> (?)`
#criv1alpha2test	`35.78% <31.81%> (?)`
#integrationtest	`40.13% <31.81%> (?)`
#nodee2etest	`33.16% <31.81%> (?)`
#unittest	`26.73% <72.72%> (?)`

Impacted Files	Coverage Δ
daemon/mgr/spec_hook.go	`22.72% <0%> (ø)`
daemon/mgr/spec_nvidia_hook.go	`76.19% <76.19%> (ø)`

CLAassistant · 2018-11-06T07:28:00Z

All committers have signed the CLA.

HusterWan · 2018-11-06T07:53:03Z

daemon/mgr/spec_nvidia_hook_test.go

+	fullname := path.Join(installDir, nvidiaHookName)
+	os.Remove(fullname)
+	os.Create(fullname)
+	os.Chmod(fullname, 0755)


How about adding defer os.Remove(fullname) here? delete the test-nvidia-container-runtime-hook after test finished

Actually, there is defer function at line22.
Sometimes test was broken unexpected, we should delete the mock file at first.

Sorry master Xiong, my fault

Take it easy man 😁

allencloud · 2018-11-06T07:59:58Z

daemon/mgr/spec_nvidia_hook.go

+)
+
+var (
+	nvidiaHookName = "nvidia-container-runtime-hook"


please add more description to this variable. Is this is local file when executing hook?

Both nvidia-container-runtime-hook and nvidia-container-cli was packaged in pouch rpm.

allencloud · 2018-11-06T08:02:44Z

docs/features/pouch_with_gpu.md

@@ -0,0 +1,36 @@
+# PouchContainer with GPU
+


Before the graph, I think we should add a general introduction of the usage of PouchContainer with GPU.

zhuangqh · 2018-11-06T10:55:44Z

UT fails @CodeJuan

=== RUN   Test_setNvidiaHook
--- FAIL: Test_setNvidiaHook (0.00s)
	spec_nvidia_hook_test.go:108: setNvidiaHook = exec: "test-nvidia-container-runtime-hook": executable file not found in $PATH, want <nil>
	spec_nvidia_hook_test.go:111: setNvidiaHook = [], want [{ [ prestart] [] <nil>}]
	spec_nvidia_hook_test.go:108: setNvidiaHook = exec: "test-nvidia-container-runtime-hook": executable file not found in $PATH, want <nil>
	spec_nvidia_hook_test.go:111: setNvidiaHook = [], want [{ [ prestart] [] <nil>}]
FAIL
coverage: 10.8% of statements
FAIL	github.com/alibaba/pouch/daemon/mgr	0.593s
make: *** [unit-test] Error 1

Signed-off-by: codejuan <xh@decbug.com>

zhuangqh · 2018-11-09T07:50:51Z

LGTM

feature: nvidia-gpu 2.0 make compatible for k8s

a52c6f3

Signed-off-by: codejuan <xh@decbug.com>

pouchrobot added the size/XL label Nov 6, 2018

CodeJuan force-pushed the sync_gpu2.0 branch 2 times, most recently from 4215eb2 to 9b18298 Compare November 6, 2018 07:50

HusterWan reviewed Nov 6, 2018

View reviewed changes

allencloud reviewed Nov 6, 2018

View reviewed changes

CodeJuan force-pushed the sync_gpu2.0 branch 4 times, most recently from c1ead44 to 70c2561 Compare November 6, 2018 08:30

CodeJuan force-pushed the sync_gpu2.0 branch 3 times, most recently from c54fb26 to bbde654 Compare November 7, 2018 06:52

docs: how to enable nvidia gpu 2.0 in Pouch

727f079

Signed-off-by: codejuan <xh@decbug.com>

CodeJuan force-pushed the sync_gpu2.0 branch from bbde654 to 727f079 Compare November 9, 2018 03:46

zhuangqh merged commit d16badf into AliyunContainerService:master Nov 9, 2018

pouchrobot mentioned this pull request Nov 16, 2018

WeeklyReport of PouchContainer from 2018-11-09 to 2018-11-16 #2471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-gpu 2.0 make compatible for k8s #2434

nvidia-gpu 2.0 make compatible for k8s #2434

CodeJuan commented Nov 6, 2018 •

edited

Loading

codecov bot commented Nov 6, 2018 •

edited

Loading

CLAassistant commented Nov 6, 2018 •

edited

Loading

HusterWan Nov 6, 2018

CodeJuan Nov 6, 2018

HusterWan Nov 6, 2018

CodeJuan Nov 6, 2018

allencloud Nov 6, 2018

CodeJuan Nov 6, 2018

allencloud Nov 6, 2018

CodeJuan Nov 6, 2018

zhuangqh commented Nov 6, 2018

zhuangqh commented Nov 9, 2018

nvidia-gpu 2.0 make compatible for k8s #2434

nvidia-gpu 2.0 make compatible for k8s #2434

Conversation

CodeJuan commented Nov 6, 2018 • edited Loading

Ⅰ. Describe what this PR did

Ⅱ. Does this pull request fix one issue?

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗？)

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

codecov bot commented Nov 6, 2018 • edited Loading

Codecov Report

CLAassistant commented Nov 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuangqh commented Nov 6, 2018

zhuangqh commented Nov 9, 2018

CodeJuan commented Nov 6, 2018 •

edited

Loading

codecov bot commented Nov 6, 2018 •

edited

Loading

CLAassistant commented Nov 6, 2018 •

edited

Loading