Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split functional testings via github action matrix #14078

Merged
merged 1 commit into from
Mar 15, 2023

Conversation

mcmilk
Copy link
Contributor

@mcmilk mcmilk commented Oct 23, 2022

Motivation and Context

The functional testimgs are timing out more and more ... the max. time for execution is 360m ... this seems to be reached :(

Description

This commit changes the workflow of the github actions.

We split the workflow into different parts:

  1. build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
  2. 2x zloop test (~10m) + 2x sanity test (~25m)
  3. functional testings in parts 1..5 (each ~1h)
    • these could be triggered, when sanity tests are ok
    • currently I just start them all in the same time
  4. cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modfied a bit:

  • report info about CPU and checksum benchmarks
  • reset the debugging logs for each test
    • when some error occurred, we call dmesg with -c to get
      only the log output for the last failed test
    • we empty also the dbgsys

The testings are done this way

flowchart TB
subgraph CleanUp and Summary
  Part1-20.04-->CleanUp+nice+Summary
  Part2-20.04-->CleanUp+nice+Summary
  PartN-20.04-->CleanUp+nice+Summary
  Part1-22.04-->CleanUp+nice+Summary
  Part2-22.04-->CleanUp+nice+Summary
  PartN-22.04-->CleanUp+nice+Summary
end

subgraph Testings Ubuntu 20.04
  sanity-checks-20.04
  functional-testing-20.04-->Part1-20.04
  functional-testing-20.04-->Part2-20.04
  functional-testing-20.04-->PartN-20.04
  zloop-checks-20.04
end

subgraph Testings Ubuntu 22.04
  sanity-checks-22.04
  functional-testing-22.04-->Part1-22.04
  functional-testing-22.04-->Part2-22.04
  functional-testing-22.04-->PartN-22.04
  zloop-checks-22.04
end

subgraph Code Checking + Building
  codeql.yml
  Build-Ubuntu-20.04-->sanity-checks-20.04
  Build-Ubuntu-22.04-->sanity-checks-22.04
  Build-Ubuntu-20.04-->zloop-checks-20.04
  Build-Ubuntu-22.04-->zloop-checks-22.04
  checkstyle.yml
end
Loading

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@mcmilk mcmilk force-pushed the split-actions branch 5 times, most recently from 44580e1 to 00a75d1 Compare October 24, 2022 06:33
@gmelikov
Copy link
Member

1h 49m looks like a great result for full test suite run even as a start!

@mcmilk
Copy link
Contributor Author

mcmilk commented Oct 24, 2022

1h 49m looks like a great result for full test suite run even as a start!

Yes ;-)

I have removed the Ubuntu 22.04 sanity and functional tests, so this PR could maybe merged a bit faster.
The commit message should also be fine now and also some ZFS module cleanup was added as last step.

@gmelikov
Copy link
Member

Overall thoughts:

  • speed gain is really a good reason to use it, and we won't get into timeouts in the foreseeable future
  • one workflow file makes it harder to read manifest, but we can impove it in future by moving common parts in scripts
  • many additional checks in PR's status worsens its readability, but looks like there is no other way to parallelize now https://github.com/orgs/community/discussions/26246

@andrewc12
Copy link
Contributor

I like this idea of splitting it up like this in a matrix. tried it myself just didn't end up writing a script. :)

@mcmilk
Copy link
Contributor Author

mcmilk commented Oct 25, 2022

I like this idea of splitting it up like this in a matrix. tried it myself just didn't end up writing a script. :)

We should use this for MacOS and Windows as well... when you have problems with this yaml stuff, I could help a bit.

@andrewc12
Copy link
Contributor

We should use this for MacOS and Windows as well... when you have problems with this yaml stuff, I could help a bit.

Heh, I'm usually fine with that stuff, but honestly if you think you could help with anything I'd be glad to take it.

@behlendorf
Copy link
Contributor

we can exclude sanity tests group from full runners after all

I agree, since the sanity tests are just a subset of the full test run there's not much value in running them, unless they're used to gate a full test run.

If we parallelize the tests like this we could also split them up in some meaningful way. Perhaps "cli tests" (cli_root, cli_user), "feature tests" (initialize, trim, compression, fallocate, quota, send/recv, snapshots, checkpoints, etc), and "resilience tests" (raidz, redundancy, removal, replacement, slog, online_offline, scrub_mirror, etc). Or some other relatively equal split. We'd just need to settle on the group names and add them to the tags sections of the run file.

My inclination would be to try and split it only 3 or 4 ways to cut down a bit on the number of status reports and try and keep it as readable as possible.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 27, 2022
@mcmilk
Copy link
Contributor Author

mcmilk commented Oct 27, 2022

we can exclude sanity tests group from full runners after all

I agree, since the sanity tests are just a subset of the full test run there's not much value in running them, unless they're used to gate a full test run.

Then I will remove it.

If we parallelize the tests like this we could also split them up in some meaningful way. Perhaps "cli tests" (cli_root, cli_user), "feature tests" (initialize, trim, compression, fallocate, quota, send/recv, snapshots, checkpoints, etc), and "resilience tests" (raidz, redundancy, removal, replacement, slog, online_offline, scrub_mirror, etc). Or some other relatively equal split. We'd just need to settle on the group names and add them to the tags sections of the run file.

Okay, i will try that ;)

My inclination would be to try and split it only 3 or 4 ways to cut down a bit on the number of status reports and try and keep it as readable as possible.

We could also collect each summary of the tags and create a big one in the last step.

@behlendorf
Copy link
Contributor

We could also collect each summary of the tags and create a big one in the last step.

That would be pretty cool. Personally what I've always found to be the most useful as a starting point when a test run fails in the buildbot summary page. It attempts to grab the meaningful log sections for FAILED and KILLED tests so they're easier to pull up an spot check. That's handled by just a little bit of shell here.

https://github.com/openzfs/zfs-buildbot/blob/master/scripts/bb-test-zfstests.sh#L33-L38

@mcmilk
Copy link
Contributor Author

mcmilk commented Oct 30, 2022

If we parallelize the tests like this we could also split them up in some meaningful way. Perhaps "cli tests" (cli_root, cli_user), "feature tests" (initialize, trim, compression, fallocate, quota, send/recv, snapshots, checkpoints, etc), and "resilience tests" (raidz, redundancy, removal, replacement, slog, online_offline, scrub_mirror, etc). Or some other relatively equal split. We'd just need to settle on the group names and add them to the tags sections of the run file.

I created a helper script for modifying the common.run file: https://gist.github.com/mcmilk/acc54a671ad2164ce98a0996a2ad6562

@behlendorf - maybe you want to have some changes on this initial grouping ?

@gmelikov
Copy link
Member

If we go into hardcoding tags, then we may want to at least find non-described tags and run it in one of parts

@mcmilk mcmilk force-pushed the split-actions branch 6 times, most recently from 2850c9b to 51775d7 Compare March 3, 2023 20:20
@mcmilk mcmilk force-pushed the split-actions branch 3 times, most recently from 055230c to 4e88c94 Compare March 9, 2023 17:00
.github/workflows/README.md Show resolved Hide resolved
.github/workflows/scripts/generate-summary.sh Outdated Show resolved Hide resolved
.github/workflows/scripts/generate-summary.sh Outdated Show resolved Hide resolved
.github/workflows/scripts/generate-summary.sh Outdated Show resolved Hide resolved
.github/workflows/build-dependencies.txt Outdated Show resolved Hide resolved
.github/workflows/build-dependencies.txt Outdated Show resolved Hide resolved
@mcmilk mcmilk force-pushed the split-actions branch 7 times, most recently from 476fe47 to 54b7072 Compare March 13, 2023 10:55
@mcmilk
Copy link
Contributor Author

mcmilk commented Mar 13, 2023

Okay, the main thing is done now.

Sometimes the functional testings (part3) will fail. This seems unrelated to this functional tests splitting.
Also the old workflows seem to hang sometimes on the slog tests. The test slog_014_pos (run as root) [00:21] seems fine, and the next one is timing out ....

I splitted the "Cleanup and summary" into 5 calls -because we have a limit of 1 MiB per step which can be put into the summary. The first call does the summary things and the next 4 calls will print debugging details (dmesg, dbgmsg) of the failed tests.

This commit changes the workflow of the github actions.

We split the workflow into different parts:

1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
   - these could be triggered, when sanity tests are ok
   - currently I just start them all in the same time
4) cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modfied a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
  - when some error occurred, we call dmesg with -c to get
    only the log output for the last failed test
  - we empty also the dbgsys

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
@mcmilk
Copy link
Contributor Author

mcmilk commented Mar 14, 2023

This is request for review now?

We went down from >4h to 1h 45m ;-)
We also have a sumary in the end and some first debugging can be done with the debug files in the end. These files are truncated to ~1MB ... but this should be okay for most cases.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a summary page where we can easily see the logs from tests which failed is real quality of life improvement! As are the faster test run times. I'm good with this, let's go ahead and merge this and refine as needed. Thanks.

@behlendorf behlendorf requested a review from gmelikov March 14, 2023 21:25
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 14, 2023
Copy link
Member

@gmelikov gmelikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's try and use it! Great work @mcmilk , thank you!

/var/tmp/zloop/*/vdev/
if-no-files-found: ignore

sanity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we discussed that now we may not even need sanity tests, but this PR is already to good to be true (thanks @mcmilk ) so let it be

Copy link
Contributor Author

@mcmilk mcmilk Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I lknow, I should remove them. But it's an first most things seem to work or do not work.
When some test within sanity will fail ... then the errors are also listed like in the functional testings.
But in good pull requests, this should never happen. So when it's happening, it's a first sth. is not okay here info.

with:
name: Summary Files
path: Summary/
- uses: geekyeggo/delete-artifact@v2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may leave artifacts, it may be convenient for someone to download already built debug build-release, but just loud thoughts, never mind for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the generated modules will be deleted, but all logs are saved now.
Currently the logs are only saved, when sth. failed .... this is changed to we log always.

@behlendorf behlendorf merged commit b7bc334 into openzfs:master Mar 15, 2023
@behlendorf
Copy link
Contributor

Merged. Let's see how it goes and refine as needed. Thanks!

@mcmilk
Copy link
Contributor Author

mcmilk commented Mar 15, 2023

There is an error, when the images rotate: https://github.com/actions/runner-images#available-images

So currently some runners have 20230313, and some have 20230305 .... I checked this via the file tests/ImageOS.txt ... and the module should compile if this applies. But of cause, we didn't checkout the current release with git ... I need to add this.

I will create a pull request for it... sorry, but this issue could not be testet, cause all runners had the same version ...

@gmelikov
Copy link
Member

@mcmilk that was fast to get into this corner case! Glad we can just rerun whole pipeline

@mcmilk
Copy link
Contributor Author

mcmilk commented Mar 15, 2023

The simple fix for this, is to add the repo checkout on each functional testing ... this could be replace by one single line within the script ... but this needs to be tested a bit more then.

lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 17, 2023
This commit changes the workflow of the github actions.

We split the workflow into different parts:

1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
   - these could be triggered, when sanity tests are ok
   - currently I just start them all in the same time
4) cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modified a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
  - when some error occurred, we call dmesg with -c to get
    only the log output for the last failed test
  - we empty also the dbgsys

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes openzfs#14078
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 17, 2023
This commit changes the workflow of the github actions.

We split the workflow into different parts:

1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
   - these could be triggered, when sanity tests are ok
   - currently I just start them all in the same time
4) cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modified a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
  - when some error occurred, we call dmesg with -c to get
    only the log output for the last failed test
  - we empty also the dbgsys

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes openzfs#14078
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 17, 2023
This commit changes the workflow of the github actions.

We split the workflow into different parts:

1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
   - these could be triggered, when sanity tests are ok
   - currently I just start them all in the same time
4) cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modified a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
  - when some error occurred, we call dmesg with -c to get
    only the log output for the last failed test
  - we empty also the dbgsys

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes openzfs#14078
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 17, 2023
This commit changes the workflow of the github actions.

We split the workflow into different parts:

1) build zfs modules for Ubuntu 20.04 and 22.04 (~25m)
2) 2x zloop test (~10m) + 2x sanity test (~25m)
3) functional testings in parts 1..5 (each ~1h)
   - these could be triggered, when sanity tests are ok
   - currently I just start them all in the same time
4) cleanup and create summary

When everything is fine, the full run with all testings
should be done in around 2 hours.

The codeql.yml and checkstyle.yml are not part in this circle.

The testings are also modified a bit:
- report info about CPU and checksum benchmarks
- reset the debugging logs for each test
  - when some error occurred, we call dmesg with -c to get
    only the log output for the last failed test
  - we empty also the dbgsys

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes openzfs#14078
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants