Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature epic metrics #2536

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

BruceKropp-Raytheon
Copy link
Collaborator

@BruceKropp-Raytheon BruceKropp-Raytheon commented Dec 12, 2024

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Adding CI/CD scripts to support collection of build and test stage metrics during both Nightly builds and PR builds.
Includes a new Jenksfile that can be tried as a replacement for ./tests/ci/Jenkinsfile.combined.

Commit Message:

CI/CD Automation tools to support UFS WM Infrastructure Metrics Dashboard

* UFSWM - 
  * use RT labels to trigger Jenkins builds 

Priority:

  • Normal

Git Tracking

UFSWM:

  • None

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@BruceKropp-Raytheon
Copy link
Collaborator Author

I have tested this on Hera, Hercules, and Orion.
Others TBD: Jet, Gaea, Derecho.
Stretch goal: PW hosts.

@BruceKropp-Raytheon BruceKropp-Raytheon added the hera-RT Run Hera regression testing label Dec 12, 2024
 on-behalf-of @ufs-community <ecc.platform@noaa.gov>
@epic-cicd-jenkins epic-cicd-jenkins removed the hera-RT Run Hera regression testing label Dec 12, 2024
@BruceKropp-Raytheon BruceKropp-Raytheon added the hercules-RT Run Hera regression testing label Dec 13, 2024
 on-behalf-of @ufs-community <ecc.platform@noaa.gov>
@epic-cicd-jenkins epic-cicd-jenkins removed the hercules-RT Run Hera regression testing label Dec 13, 2024
@BruceKropp-Raytheon
Copy link
Collaborator Author

BruceKropp-Raytheon commented Dec 13, 2024

This is an old issue for Orion, unrelated to this PR:

Running regression tests on Orion
+ cd ..
+ module load git/2.28.0
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_sh_dbg=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output
Lmod has detected the following error: The following module(s) are unknown:
"git/2.28.0"

probably can remove lines 73 and 222 from ./tests/ci/Jenkinsfile.combined, and just use the latest git on the system:
module load git/2.28.0

@BruceKropp-Raytheon BruceKropp-Raytheon added the derecho-RT Run regression tests on Derecho label Dec 13, 2024
 on-behalf-of @ufs-community <ecc.platform@noaa.gov>
@epic-cicd-jenkins epic-cicd-jenkins removed the derecho-RT Run regression tests on Derecho label Dec 13, 2024
Signed-off-by: Bruce Kropp <bruce.kropp@raytheon.com>
…/ufs-weather-model into feature/epic_metrics

sync with PR build results
@BruceKropp-Raytheon BruceKropp-Raytheon added the hera-RT Run Hera regression testing label Dec 13, 2024
 on-behalf-of @ufs-community <ecc.platform@noaa.gov>
@epic-cicd-jenkins epic-cicd-jenkins removed the hera-RT Run Hera regression testing label Dec 13, 2024
kbooker79
kbooker79 previously approved these changes Dec 18, 2024
Signed-off-by: Bruce Kropp <bruce.kropp@raytheon.com>
@DusanJovic-NOAA
Copy link
Collaborator

This PR adds almost 2500 lines of non-trivial scripts. I find it impossible to review it by just looking at the code (diff). So, I'm approving this PR based on the assumption that it works for EPIC and that it does whatever it is supposed to do. The PR description does not explain what expected results are or how it can be, if it can be, used on supported platforms without using CI/CD Jenkins scripts.

@DeniseWorthen
Copy link
Collaborator

@BruceKropp-Raytheon Please explain exactly what this PR is doing, how and why? I don't really see how static pngs (on the dashboard) at all address the original issue which was a) tracking performance over time and b) scraping existing information from commit logs. Does this PR scrape information, or produce it's own (ie, somehow run it's own RT)?

@DeniseWorthen
Copy link
Collaborator

So it looks like this PR is a combination of features, one which updates the jenkins CI and one which supposedly addresses the metric tracking issue. If it is a combination, it should be split into the relevant parts for basic good CM practice.

For the metrics, Dusan was able to spend about an hour implementing something which actually addresses #2527.

$ cat rt_logs_grep_time.sh
#!/bin/bash
set -eu

TEST=$1
BRANCH=develop

for COMMIT in $(git log --since="2024-06-01" --format="%H" ${BRANCH}); do
  TIME=$(git show ${COMMIT}:tests/logs/RegressionTests_hera.log | grep " '${TEST}' " | awk '{print $6}' | awk -F']' '{print $1}')
  echo "${COMMIT} ${TIME}"
done

Which produces the following

$ ./rt_logs_grep_time.sh cpld_control_p8_intel
5324d642e2257ee659ab75c0fc404d3127b2d9f7 15:46
76471dc6b7bfc3342416d1a3402f360724f7c0fa 15:45
241dd8e3b9feae29f1925806bdb05816ae49f427 15:40
295008915d1ad09fb5d4e24624d0c19627273af4 15:58
e1193704767800bfaece56eb2a4b058bd4d0afbc 15:24
6ec6b458b6dc09af48658146d3908502b18272cf 15:28
409bc85b64b2ced642b0024cef2cd9c78ce46fd9 15:26
63ace62a36a263f03b914a92fc5536509e862dbc 13:58
a3c3bb587cdb6905a3d3635a4ef502547ff60598 13:58
144ccb03e6c82edae73cf12a496ebf060fed65f7 14:10
33b3c18774a994b3b05da4489b08f34115adbf48 14:02
c0367fdf0885493af6a5446b38eb77405a6230e1 13:59
6b0f516557811eac82c17b852efb82a35892b022 14:35
29c2703c715ebdb47bbd4bcc811db340eae530e5 13:00
058f07361b7f53a76e4cbb057aaebbbefffd34e5 13:00
f9c91d3df80a8536cf2a226fac5d826889e55c17 13:02
547be6d379f5b213b47eb3eacc9c5211fb95b6ab 13:29
be4544ee28f8fad7bc2cdb207dc62f89c4aa2bb2 07:30
db1781a05dce1125cfe17f8324650674640f0a9e 07:59
f3ce1698b00bc1039f73f662e9e107f9c424201f 07:51
e3750c28119deec4b133cacca81d49ba62b2670c 07:28
bad50ef5023860c992b75cb72722cba9bb428ceb 07:38
2ccc549348da37aac51ab44482174dff2bb2912d 07:40
38a29a62461cb1f9bf530420d5bc2f73a4650724 07:48
25ee7f6ca087ee19991e684a3c83e451921d5770 07:35
706219146401bec7a29e7384eb1a642392ca47fe 07:44
6a4e09e94773ffa39ce7ab6a54a885efada91f21 06:15
9ae4f54282e00df8c8ec68c883905f49b8d5d826 06:02
1c4fcf1ca75fa24326bd2af857dafa2f51347506 05:57
94a3cd7f6afa1091bad6b8f57cdc5b7712849dfb 06:07
fcc9f8461db5eafbfd1f080da61ea79156ca0145 05:56

This generates actual useful information, although no pretty pictures. Commit hash 547be6d clearly doubled the run time. If we had seen that at the time, we would have held the PR to determine the cause.

@BruceKropp-Raytheon
Copy link
Collaborator Author

@DeniseWorthen
Thank you for the clarification. I suspect the metrics requested from #2527 differ from the metrics EPIC is trying to report, so maybe this PR doesn't add value for #2527.
Simply, this PR is to provide a new Jenkinsfile that can collect metrics from Regression Tests (RT), save JSON formed data, so that they can be post-processed and presented in a chart on the EPIC dashboard.

Relevant JSON will look something like:

{
 "name": "Test" ,
 "type": "CI" ,
 "run": {
 "dateTime": "1734512640548",
 "builtOn": "all",
 "platform": "hercules",
 "compiler": "intel",
 "branch": "develop",
 "steps": [ 
   {
   "id": "512",
   "name": "scripts/wm_test.sh",
   "type": "STEP",
   "startTime": "2024-12-18T09:14:41.524+0000",
   "result": "SUCCESS",
   "durationInMillis": 1107792,
   "displayName": "Shell Script",
   "displayDescription": null,
   "tests": [ 
      {
       "ExperimentName": "control_p8 intel",
       "Status": "COMPLETE",
       "tasks": [
          {
           "type": "COMPILE",
           "task": "atm_dyn32_intel",
           "WallTime": "11:11",
           "Duration": "09:17",
           "mbytes": 0,
           "Status": "PASS",
           "Reason": ""
          },
          { 
           "type": "TEST",
           "task": "control_p8_intel",
           "WallTime": "05:53",
           "Duration": "03:16",
           "mbytes": 1889,
           "Status": "PASS",
           "Reason": ""
          } 
        ] 
      } 
    ] 
  } 
],
 "result": "SUCCESS" 
}
} 
}

We wouldn't expect to produce any images from this here, as those would be derived from the JSON as part of EPIC web dashboard effort.

@DeniseWorthen
Copy link
Collaborator

@BruceKropp-Raytheon Thanks. So as far as I understand, this PR will not close or address issue #2527. Could you please remove the reference to that issue, so that it doesn't accidentally get closed? Thanks.

@BruceKropp-Raytheon
Copy link
Collaborator Author

okay. the reference is removed from the description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants