Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request to keep track of memory, wall time and CPU associated with output files. #110

Open
knoepfel opened this issue Oct 28, 2021 · 4 comments
Assignees
Labels
idea Just thinking it might be nice to have

Comments

@knoepfel
Copy link
Contributor

This issue has been migrated from https://cdcvs.fnal.gov/redmine/issues/26068 (FNAL account required)
Originally created by @hschellman on 2021-07-23 22:58:22


Is it possible to get the memory, wall time and CPU utilization for a job written in the sam (or successor) metadata for an output file? Sounds simple at first, just dump at end of job but if you are writing multiple files to multiple streams it gets complicated as one would need to maintain a separate stats struct for each file that initializes at file open and writes to the metadata at file end. Some of this obviously exists as Art does produce metadata for files.

(I wrote the D0 sam output interface back in the days of the ancients so know you can do this if you can find the file open/close hooks). May have used FORTRAN 2 for all I know.

DUNE is hoping to really instrument our jobs and this would be a great help.

@knoepfel knoepfel added the idea Just thinking it might be nice to have label Oct 28, 2021
@knoepfel knoepfel self-assigned this Oct 28, 2021
@knoepfel
Copy link
Contributor Author

Comment by @knoepfel on 2021-07-27 21:23:58


Heidi, we should probably have a meeting to discuss this idea. Some of the metrics are already captured by art, but it's not clear to us what exactly you're after. I'll setup a meeting.

@knoepfel
Copy link
Contributor Author

Comment by @knoepfel on 2021-08-17 15:51:52


Tom and I met this morning to discuss what is being asked of this proposal. After some discussion, it seemed that what is asked is just enough information persisted to the on-disk SAM metadata to identify a workflow/job that is problematic wrt timing and memory usage. After identifying a problematic job using the SAM metadata information, a user can interactively run the job to debug or profile further. At this point, only overall wall clock time and the max. memory usage would be necessary to persist to the metadata.

Does that sound sensible?

@knoepfel
Copy link
Contributor Author

Comment by @tomjunk on 2021-08-17 16:28:35


Yes, sounds good. Though the original request was for three numbers -- memory, wall time and CPU time. This doesn't capture all bottlenecks -- for example, some jobs spend a lot of wall time waiting for files before art even starts, but it is a big help, and we cannot ask art to solve that problem. It may be possible to get the art wall time from sam_metadat_dumper's output of start_time and end_time and subtracting them, but a separate field pre-subtracted may be even more convenient. Thanks!

@hschellman
Copy link

hschellman commented Oct 28, 2021 via email

@knoepfel knoepfel added this to Issues Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Just thinking it might be nice to have
Projects
Status: No status
Development

No branches or pull requests

2 participants