Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional gpu tests (New) #1359

Merged
merged 18 commits into from
Aug 1, 2024
Merged

Add additional gpu tests (New) #1359

merged 18 commits into from
Aug 1, 2024

Conversation

pedro-avalos
Copy link
Collaborator

Description

  • Update gpu-setup script to work with arm64
  • Update gpu-setup script to build tests from cuda-samples
  • Add 4 new gpgpu tests from cuda-samples
  • Separate gpgpu automated tests from stress tests

Resolved issues

https://warthogs.atlassian.net/browse/CHECKBOX-967

Documentation

N/A

Tests

To run tests: install nvidia drivers; then run tools/gpu-setup bash script to install cuda-toolkit and make the cuda-sample tests.

Run checkbox-cli run com.canonical.certification::gpgpu-automated.

Copy link

codecov bot commented Jul 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.12%. Comparing base (9639ba7) to head (659a452).
Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1359   +/-   ##
=======================================
  Coverage   45.12%   45.12%           
=======================================
  Files         366      366           
  Lines       39058    39058           
  Branches     6607     6607           
=======================================
  Hits        17626    17626           
  Misses      20758    20758           
  Partials      674      674           
Flag Coverage Δ
provider-gpgpu 57.14% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the comments below. In general we have to consider distribution when adding dependencies to providers and do so carefully, there are a lot of moving parts.

Consider also rebasing this PR, you will get a new pipeline that builds the packages and checks that everything is allright.

providers/gpgpu/tools/gpu-setup Outdated Show resolved Hide resolved
providers/gpgpu/tools/gpu-setup Outdated Show resolved Hide resolved
@Hook25 Hook25 added the waiting-for-changes The review has been completed but the PR is waiting for changes from the author label Jul 29, 2024
This hardcodes the current gpg key and checks its fingerprint.
NOTE: It seems like x86_64 is the only architecture supported
everywhere. Nvidia seems to support arm64 in *some* cases, but not a
lot. Should we only support x86_64, then?
This commit changes the gpu-setup script behaviour to build the cuda-samples
and gpu-burn projects inside the `build/bin` directory, then copy them
out into the `bin/` and (the necessary data files) into `data/`. For the
cuda-samples executables to work, they need access to the data files,
but they do not take the path to the data dir as an argument; to
circumvent this limitation, I have made wrapper scripts that copy the
necessary file into the temporary working directory that checkbox
creates.

Because of the change in build behaviour, the `gpu-setup` script now runs
mostly as a regular user (to avoid permission issues when cleaning
directories/builds). The expected operation now is to run `./manage.py build`
instead of running the `gpu-setup.sh` script itself. This is more inline
with what is done with the other providers.
For now, we are limiting the gpgpu tests to x86_64 since nvidia only
supports x86_64 consistently across distributions/releases.
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the changes are taking shape. Well done!

I've tested the snap builds on your branch and they don't work anymore. This is the "tail" of the log that leads to a failure

Configuring system for GPU Testing
**********************************
*
*  Testing network connectivity
../../tools/gpu-setup: line 10: ping: command not found
ERROR: This script requires internet access to function correctly
make: *** [gpu-setup] Error 1
../../src/Makefile:5: recipe for target 'gpu-setup' failed
Failed to run 'override-build': Exit code was 2.
Build failed

To reproduce this you can do the following

cd checkbox-core-snap; ./prepare.sh series16 
cd series16
snapcraft remote-build

This should fail with the same error I'm giving you. It is crucial that you use remote-build as the image LP uses is slightly different to the one you have in lxd

As to how to fix this, try to add ping to the build-depends of the part. If that works please do also test series18, 20, 22 and 24. Note that for 24 the build process is slightly different (you have to use snapcraft8_prepare.sh and commit the changes, you can then reset HEAD~1 once you are done).

This allows the packaging to complete. The gpgpu provider still fails
due to some issues with setting up the repository, but it does not
prevent the packaging to complete. We may need to look
into vendorizing some of the dependencies...
These paths are now resolved to absolute paths to the gpgpu provider's
subdirectories. I also made sure to clean up left-over data files in the
wrapper scripts.
This reverts commit a7def84.

We will revisit properly packaging the gpgpu provider at a later time.
Copy link
Collaborator

@Hook25 Hook25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our agreement, this is way too (legacy) broken to fix it all without a follow up and we need to decide what is the best way to do that. For now we land this excluded from the provider building system.

Please do create the cards in jira to highlight the issue at hand and we will discuss them and schedule them in a future pulse

@Hook25 Hook25 assigned Hook25 and unassigned pieqq Aug 1, 2024
@pedro-avalos pedro-avalos merged commit b3df727 into main Aug 1, 2024
40 checks passed
@pedro-avalos pedro-avalos deleted the add-additional-gpu-tests branch August 1, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-changes The review has been completed but the PR is waiting for changes from the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants