-
Notifications
You must be signed in to change notification settings - Fork 208
Add fake GPU device generator for scalability testing #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The 7 plugin tests that remain in "expected" state show:
What I should do next? |
While this adds new container for the devices project, program itself does not have any dependencies (outside of Golang standard libraries + golang.org/x/sys/unix). Currently it does not have unit tests, but I'm not sure what those should do, as this itself is a (scalability) testing tool (for GPU plugin). Most relevant test would be whether GPU plugin reports expected number of GPUs for the content created by the tool, but because it's generating device files, running it requires either root or suitable capability for creating those. |
|
To facilitate GPU plugin scalability testing on a real cluster. Pre-existing (fake) sysfs & devfs content needs to be removed first: * Fake devfs directory is mounted from host so OCI runtime can "mount" device files also to workloads requesting fake devices. This means that those files can persist over fake GPU plugin life-time, and earlier files need to be removed, as they may not match * DaemonSet restarts failing init containers, so errors about content created on previous generator run would prevent getting logs of the real error on first generator run * Before removal, check that removed directory content is as expected, to avoid accidentally removing host sysfs/devfs content (in case container was erronously granted access to the real thing) Container runtime requires fake device files to real be devices: * Use NULL devices to represent fake GPU devices: https://www.kernel.org/doc/Documentation/admin-guide/devices.txt * Give more detailed logging for MkNod() failures as device node creation is most likely operation to fail when container does not have the necessary access rights Created content is based on JSON config file (instead of e.g. commandline options) so that (configMap providing) it can be updated independently of the pod where generator is run. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Config file is suitably indented so that it can be directly appended to a suitable configMap header. Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Updated doc + rebased on main now that GitHub problems are fixed, so that CI tests run for the first time. |
Codecov Report
@@ Coverage Diff @@
## main #1116 +/- ##
=======================================
Coverage 53.01% 53.01%
=======================================
Files 40 40
Lines 4350 4350
=======================================
Hits 2306 2306
Misses 1917 1917
Partials 127 127 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@mythi Any comments on this now that release is done? |
Note: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
The whole picture and earlier review comments are in the RFC PR #1104, from which this is split off.
Compared to RFC PR, I've moved / renamed generator code to
gpu_fakedev
/gpu_fakedev.go
and added documentation +intel-gpu-fakedev
container for it.