Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend SimpleMemoryCheck service to report jemalloc and smaps information, and on early termination signal #46859

Merged
merged 4 commits into from
Dec 16, 2024

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Dec 3, 2024

PR description:

This PR extends the SimpleMemoryCheck reporting with

  • A printout on the early termination (SIGUSR2 signal)
    • In case WM kills the job because of using too much memory, we'd get the numbers closest in time to that moment
  • VSIZE and RSS at the endJob (mostly to compare to other numbers added to endJob printout)
  • Overall allocator statistics from jemalloc (only if jemalloc is being used)
    • These are collected only for the "significant events" (i.e. when VSIZE or RSS is among the largest 3 VSIZE or RSS measurements), and shown at the endJob summary (when enabled)
    • These are also shown in the early termination message, and at endJob
  • Read AnonHugePages, and RSS and VSIZE of mmapped files from /proc/<pid>/smaps (if can open the file)
    • The RSS/VSIZE of mmapped files should give measures how much our code size contributes to the process' RSS/VSIZE
      • I decided to track separately the contribution of .pcm files (generated by ROOT), because I noticed the .pcm contribution was quite large (hundreds of MB)

I hope these numbers will help to figure out if the application itself is allocating a lot of memory, if the allocator is using a lot more memory than the application asks for, or if the operating system ends up using a lot of memory e.g. because of fragmentation of transparent huge pages (see #42387).

This PR was motivated by #46040, especially the behavior of steep rise of RSS.

Resolves cms-sw/framework-team#1082

PR validation:

Tested privately with an example job in #46040 that the jemalloc and smaps information gets added, also in the case the job is terminated early with SIGUSR2 signal. Tested also the SimpleMemoryCheck works when run through cmsRunGlibC and cmsRunTC (with jemalloc information missing, of course).

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

cms-bot internal usage

@makortel
Copy link
Contributor Author

makortel commented Dec 3, 2024

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • FWCore/Services (core)

@Dr15Jones, @makortel, @smuzaffar can you please review it and eventually sign? Thanks.
@fwyzard, @missirol, @wddgit this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-adcc31/43234/summary.html
COMMIT: 6f4436e
CMSSW: CMSSW_15_0_X_2024-12-03-2300/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46859/43234/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3484682
  • DQMHistoTests: Total failures: 375
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3484287
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor Author

makortel commented Dec 4, 2024

test parameters:

  • relval_options = --customise Validation/Performance/TimeMemoryInfo.customise

@makortel
Copy link
Contributor Author

makortel commented Dec 4, 2024

@cmsbuild, please test

@makortel
Copy link
Contributor Author

makortel commented Dec 4, 2024

@Dr15Jones please review

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-adcc31/43257/summary.html
COMMIT: 6f4436e
CMSSW: CMSSW_15_0_X_2024-12-04-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46859/43257/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
2024.202001 step 1
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

@makortel
Copy link
Contributor Author

makortel commented Dec 4, 2024

test parameters:

  • relval_options = --command "--customise Validation/Performance/TimeMemoryInfo.customise"

@makortel
Copy link
Contributor Author

makortel commented Dec 4, 2024

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

Pull request #46859 was updated. @Dr15Jones, @cmsbuild, @makortel, @smuzaffar can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-adcc31/43420/summary.html
COMMIT: 826f5e5
CMSSW: CMSSW_15_0_X_2024-12-12-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46859/43420/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

  • You potentially added 6387874 lines to the logs
  • Reco comparison results: 148 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3510017
  • DQMHistoTests: Total failures: 430
  • DQMHistoTests: Total nulls: 9
  • DQMHistoTests: Total successes: 3509558
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 9.379000000000003 KiB( 45 files compared)
  • DQMHistoSizes: changed ( 1000.0 ): 0.312 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 10224.0,... ): 0.289 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 11634.0,... ): 0.285 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 136.731 ): 0.199 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 136.793 ): 0.207 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 136.874 ): 0.242 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 139.001 ): 0.164 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 140.56 ): 0.047 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 16834.0 ): 0.359 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 2022.000001,... ): 0.113 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 24834.911 ): ...
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: found differences in 2 / 44 workflows

@makortel
Copy link
Contributor Author

Here is the new endJob output from 13034.0 (2024 TTBar+PU) step 3 (RECO)

MemoryReport> EndJob: virtual size 8671.84 Mbytes, RSS 5418.66 Mbytes, PSS 4781.01 MBytes, Private 4597.03
 AnonHugePages 2974 Mbytes
 mmapped memory pages 7228.6 Mbytes (VSize), 4543 MBytes (RSS)
 mmapped file pages 1439.38 Mbytes (VSize), 873.438 MBytes (RSS)
  of which .so's 1113.87 Mbytes (VSize), 590.016 MBytes (RSS)
  of which PCM's 324.945 Mbytes (VSize), 282.859 MBytes (RSS)
  of which other 0.5625 Mbytes (VSize), 0.5625 MBytes (RSS)
 Jemalloc allocated 3769.92 MBytes, active 3872.93 MBytes
  resident 4645.34 Mbytes, mapped 4674.61 Mbytes
  metadata 51.0277 Mbytes
MemoryReport> Peak virtual size 8671.59 Mbytes (RSS 6447.84)
 Jemalloc allocated 5510.01 active 5622.18 resident 5694.11 mapped 5723.38 metadata 51.0276
...
MemoryReport> Peak rss size 6447.84 Mbytes (VSIZE 8671.59)
 Jemalloc allocated 5510.01 active 5622.18 resident 5694.11 mapped 5723.38 metadata 51.0276

@makortel
Copy link
Contributor Author

test parameters:

@makortel
Copy link
Contributor Author

@cmsbuild, please test

Final round without the customization

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-adcc31/43441/summary.html
COMMIT: 826f5e5
CMSSW: CMSSW_15_0_X_2024-12-13-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46859/43441/install.sh to create a dev area with all the needed externals and cmssw changes.

  • DAS Queries: The DAS query tests failed, see the summary page for details.

Comparison Summary

Summary:

@makortel
Copy link
Contributor Author

Comparison differences are related to #46416

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @rappoccio, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 0ce2619 into cms-sw:master Dec 16, 2024
12 checks passed
@makortel makortel deleted the simpleMemoryCheck branch December 16, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend SimpleMemoryCheck service to report jemalloc and smaps information, and on early termination signal
5 participants