Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIME5: is short term archiving supposed to work? #1305

Closed
golaz opened this issue Mar 11, 2017 · 69 comments
Closed

CIME5: is short term archiving supposed to work? #1305

golaz opened this issue Mar 11, 2017 · 69 comments
Assignees

Comments

@golaz
Copy link
Contributor

golaz commented Mar 11, 2017

I had the impression that short term archiving was now supposed to work with CIME5, so I decided to turn it on in my latest coupled simulation. Turns out that was a bad idea.

The short term archiving ran when the next job segment had already started, which I understand should be perfectly safe. Here are some obvious issues I encountered.

  1. Short term archiving moved the log files of the currently running job. As a result, no more log file information, although the job appears to be continuing.
  2. MPAS files are still not handled by short term archiving.

So, as a result, I have a bit of a mess now. Log files that have missing information, some components files that have been moved to their short term archiving location, and other component files still in their original location.

@rljacob
Copy link
Member

rljacob commented Mar 11, 2017

What was the compset and resolution? Try an ERR test with it on your platform.

@rljacob
Copy link
Member

rljacob commented Mar 11, 2017

I assume this was edison with the A_WCYCL case?

@golaz
Copy link
Contributor Author

golaz commented Mar 12, 2017

Yes, this was a A_WCYCL case.

@gold2718
Copy link

@golaz,
Can you post the script you used (or some pointer to it) to make it easier to study/reproduce what happened to you?

@golaz
Copy link
Contributor Author

golaz commented Mar 13, 2017

Here is my edison script.

run_acme.20170302.beta1.A_WCYCL1850S.ne30_oEC_ICG.edison

@ndkeen
Copy link
Contributor

ndkeen commented Mar 13, 2017

We are not saying this is only an issue on edison, right?

Can I lower the number of days to reproduce the problem?

set stop_units       = ndays
set stop_num         = 5
set restart_units    = $stop_units
set restart_num      = $stop_num
set num_submits      = 1
set do_short_term_archiving      = true
set do_long_term_archiving       = false

@rljacob
Copy link
Member

rljacob commented Mar 13, 2017

I would also like a shorter reproducer. But we would need to tell all the models, including MPAS, to output daily and I'm not sure how to do that.

@golaz
Copy link
Contributor Author

golaz commented Mar 13, 2017

The best way to cheaply reproduce the problem would be to run the ultra low-res coupled model (ne4_oQU240). I read somewhere that it gets around 38 SYPD. You probably just need to run it for a few years to get enough output files.

@milenaveneziani
Copy link
Contributor

@rljacob: is there a place where it is explained how short-term archiving for MPAS will be handled? I suppose the history files will go in the hist/ subdirectory, but I also wonder about the namelist and streams files. Will they go in a rest/ subdirectory or log/?
All of this will be important for post-processing and MPAS-/ACME-coupled analysis, of course. Thanks.

@rljacob
Copy link
Member

rljacob commented Mar 14, 2017

There's some developing documentation here: http://esmci.github.io/cime/doc/build/html/users_guide/running-a-case.html#archiving-model-output-data
What exactly gets copied is controlled by settings in the config_archive.xml file:
https://github.com/ACME-Climate/ACME/blob/master/cime/cime_config/acme/config_archive.xml

For mpas-o, only files with "mpaso" and "hist" in them will be copied to the ocn/hist dir in the archive. But I think we can add an entry for streams.

@milenaveneziani
Copy link
Contributor

I see. Thanks, that's helpful.
The criterium for storing mpas history files makes sense to me.
As far as namelists (mpas-*_in) and streams go, I am not sure where it would be best to keep them. For example, I cannot remember what CESM did with namelists, whether it archived them in logs/ or not at all. I could check that easily.

@rljacob
Copy link
Member

rljacob commented Mar 14, 2017

A low res case is a good test. I confirmed that master is not copying the mpas files. I'll try using next which has CIME5.2

@rljacob
Copy link
Member

rljacob commented Mar 14, 2017

namelists are copied to CaseDocs in your case directory.

@rljacob
Copy link
Member

rljacob commented Mar 16, 2017

I tried CIME5.2 and got the same result with archiving. I'll open a new issue for that.

@golaz for archiving interfering with a running job: are you using the run_script's re-submit or auto-chaining? I can see how those might interfere with what CIME is trying to do. I think if you want to use CIME's archiving while also having jobs automatically re-submit, you'll have to use CIME's resubmit feature.

@golaz
Copy link
Contributor Author

golaz commented Mar 16, 2017

@rljacob: thanks for clarifying. I thought that the interference between runs and short-term archiving might be due to the fact that I'm not using CIME's resubmit feature. That's disappointing because it means that even if short term archiving actually worked, I probably would not be able to use it.

@rljacob
Copy link
Member

rljacob commented Mar 16, 2017

Why not?

@acme-y9s
Copy link
Contributor

acme-y9s commented Mar 16, 2017 via email

@rljacob
Copy link
Member

rljacob commented Mar 16, 2017

I opened another issue for this in ESMCI. ESMCI/cime#1252

@golaz
Copy link
Contributor Author

golaz commented Mar 16, 2017

@rljacob in response to your question above. CIME5 takes a "one-size-fits-all" approach to running ACME jobs. While that approach works well for anvil, it is unfortunately poorly suited for other machines. CIME5 tools would be much more useful to me if it they could be assembled to create a custom workflow tailored for a specific need and environment (which is sometimes a moving target), rather than tools that come pre-assembled to only work in one specific fashion, such as

  • case.run should only be submitted via case.submit [the workaround for that one is easy]
  • job chaining should only be done with the built-in resubmit feature [many alternatives here]
  • case.st_archive should only be invoked via CIME5 and within case.run [probably don't have time to find a workaround for this one]

But that's off-topic and probably a philosophical difference that we will not reconcile here 😄

@rljacob
Copy link
Member

rljacob commented Mar 16, 2017

You can call case.st_archive yourself after a run is done. Just leave DOUT_S as false so CIME doesn't try to call it for you. That way you can control when exactly it happens in conjunction with your calling of case.run.

I forgot about the chaining. We can work on supporting that.

@huiwanpnnl
Copy link
Contributor

huiwanpnnl commented Mar 16, 2017

Chiming in...
As @golaz mentioned, case.run is now only allowed to be submitted via case.submit. This means the PBS/SLURM-based job bundling we used in the past (see here, here, and here) does not work anymore. If the workaround is not complicated, could you @rljacob help us on this? Thanks.

@gold2718
Copy link

@golaz et.al,
The problem that spawned this issue is due to a regression in an off-script usage of the system scripts. While we are always free to use the available tools for any purpose (and many of us have been 'rolling our own' for decades), I would like to more to having fewer regressions for production users.

The main advantage to using the CIME scripts as documented is that those use patterns are tested before any updates make their way to ACME master. Therefore, I would like to get your use cases adopted as CIME standard usage so that it is always tested.

In particular, the current need seems to be job bundling (as mentioned by @huiwanpnnl above). Rather than solving this problem for each machine and update (either CIME update or system update), I would like to make this a supported feature.

Please consider opening a new issue on ESMCI/CIME to specify the requirements of this feature.

Flagging @mfdeakin-sandia to help with this process.

@worleyph
Copy link
Contributor

@gold2718 , so how are supported use patterns identified typically? It is not as if job bundling is something new. It is definitely not peculiar to ACME. "we" don't know that something is going away until the next version comes out and it is determined that a capability has disappeared. The perception, correct or not, is that each version of CIME is less flexible than the previous one, perhaps because the required use patterns are not yet sufficiently broad? While adding this as a request may get the capability added back in in the future, the capability is needed now.

@gold2718
Copy link

@worleyph,

'How are use patterns identified typically?'

By users speaking up as @golaz has done.

'It is not as if job bundling is something new. It is definitely not peculiar to ACME.'

If it is not new, at least to ACME, then perhaps something is wrong with the ACME development processes since ACME team members provided approximately half the CIME development efforts. Do have you have any suggestions for process improvement or are you just venting?

'The perception, correct or not, is that each version of CIME is less flexible than the previous one, perhaps because the required use patterns are not yet sufficiently broad?'

Perhaps, you meant 'My perception ...' or if not, please name the cohort which has 'The' perception so that we can poll them.

'While adding this as a request may get the capability added back in in the future, the capability is needed now.'

I am all for finding short-term workarounds but in order to ensure stability, we do need to identify required use patterns and make sure they are tested against regressions. This is software engineering 101, I assume it is not news to you?

@rljacob
Copy link
Member

rljacob commented Mar 19, 2017

@huiwanpnnl, use "./case.submit --no-batch" if you have your own batch script controlling things.

@worleyph
Copy link
Contributor

worleyph commented Mar 19, 2017

If it is not new, at least to ACME, then perhaps something is wrong with the ACME development processes since ACME team members provided approximately half the CIME development efforts. Do have you have any suggestions for process improvement or are you just venting?

This is an ACME github page, so of course my audience is ACME CIME developers :-). Partially venting since capabilites that I care about seem to disappear with each release, and I have to spend time figuring out how to put them back in. And not being a CIME developer that has been increasingly difficult.

I have always had the hope that each generation of CIME should at least have the capabilities of the previous version. Documenting what those are appears to be a bottleneck?

Bundling jobs has been a use case preceding ACME - I expect that CESM users would appreciate this capability as well.

Perhaps, you meant 'My perception ...' or if not, please name the cohort which has 'The' perception so that we can poll them.

I will only speak for myself. Others will have to self-identify.

I am all for finding short-term workarounds but in order to ensure stability, ...

Short-term workarounds can be difficult to design, but I am probably being overly pessimistic here. I'll let the CIME wizards determine how to put this back in.

(Update: corrected misattribution to Confluence, as pointed out by @gold2718 .)

@rljacob
Copy link
Member

rljacob commented Mar 19, 2017

Repeating in case it gets lost: job scripts that "bundle" like @huiwanpnnl pointed to should still work with CIME5. You just need to use "./case.submit --no-batch" in place of "./case.run".

It turns out you can invoke case.st_archive outside of case.submit/run. You can run it at command line inside your case directory. The bigger problem is that it doesn't process mpas files and we're working on that.

@golaz
Copy link
Contributor Author

golaz commented May 2, 2017

I'm trying to understand where we stand. This issue was originally opened because of two problems:

  1. Short term archiving moved the log files of the currently running job. As a result, no more log file information, although the job appears to be continuing.
  2. MPAS files are still not handled by short term archiving.

Looking at https://acme-climate.atlassian.net/browse/S2-130, it appears that (2) has now been fixed, but (1) has not. This would imply that it is currently not safe to invoke short term archiving while the model is running?

@rljacob : can you confirm?

@rljacob
Copy link
Member

rljacob commented May 2, 2017

Yes that's right. Its not safe to invoke while the model is running.

@rljacob
Copy link
Member

rljacob commented May 2, 2017

The log file request is a new feature so its been opened as an issue in JIRA.

@golaz
Copy link
Contributor Author

golaz commented May 2, 2017

My understanding was that pre-CIME, it was safe to invoke short term archiving while the model was running. The log file request was a compromise to make it easier for SE to modify short term archiving to be safe. So I view everything here as a bug fix.

@golaz golaz reopened this May 2, 2017
@PeterCaldwell
Copy link
Contributor

I agree with Chris' understanding. In the past when you turned on short-term archiving, each time a job submission completed a 1-node job would be launched which would move all the files from that job into the short-term archiving location while the next job was running. The ability to have short-term archiving work as part of the normal job submission process is important because otherwise the user has to stop doing model runs and invoke short-term archiving by hand. This is a lot of unnecessary work which slows things down and provides ample opportunity for screwing things up.

@rljacob
Copy link
Member

rljacob commented May 2, 2017

Short term archiving with job submission is working. If DOUT_S is TRUE, when you run "case.submit", 2 jobs will be submitted, one for the run and one for the archiving. The archiving will start as soon as the run finishes. A new feature lets you put the archiving job in a faster queue. If RESUBMIT > 0, then when the first pair finishes, CIME will submit another pair of jobs to continue the run, decrement RESUBMIT by 1 and keep doing that until RESUBMIT is 0. That should all work.

@PeterCaldwell
Copy link
Contributor

Ok, so if our executables are 'acme' and 'archiving', I think you are saying that job submission works like this:

submit acme and {archiving with dependency on acme}
once acme finishes, archiving operates
once archiving is done, submit a new round of acme + {archiving with dependency on acme}

I think this will work, but it is less efficient than what used to happen, which is this:

submit acme
once acme finishes, it submits a new round of acme and an archiving script

Having archiving ruin logs of currently-running acme simulations would clearly not be acceptable in this old workflow and is, I think, what Chris was concerned about. Can you confirm that my assumptions about how jobs are submitted now is correct?

@rljacob
Copy link
Member

rljacob commented May 2, 2017

Yes with the CIME script system, the second pair is not submitted until the first pair is complete. If there's an error, the whole process stops and the run directory is left as-is.

I don't understand your example of what used to happen. Only an acme job would be submitted and when that finished then the archiver (for that run) would be submitted as well as the next job? That sounds less efficient.

@PeterCaldwell
Copy link
Contributor

I'll try to explain the old version in more detail: start by imagining a job without short-term archiving. As soon as it completes one submission, it starts another. Now add short term archiving by having acme launch an archiving script at the end of each of its submissions which cleans up all the files created by the job that just finished. Don't impose any dependency on this short-term archiving job because you know that the job it is meant to clean up has already finished. This old way is more efficient because archiving and simulation can occur in parallel, while the way you've set things up now serializes archiving and simulation.

Is this serialization a big deal? Not if the archiving job gets through the queue and runs quickly (say, in less than 1/2 hr). But on machines where you spend more time waiting in the queue than do do running, this serialization could slow down time-to-solution by a huge amount!

@rljacob
Copy link
Member

rljacob commented May 2, 2017

Ok I think I understand. But this "old version" was not a workflow implemented by an earlier version of CIME or by the pre-CIME CESM scripts. Its always been paris of run-and-archive, submitted in sequence. The archiver has always moved all the log and history files and has never been documented as safe to use while another run is going.

@rljacob
Copy link
Member

rljacob commented May 2, 2017

As mentioned above in an older comment, the archiver can be submitted to the "xfer" queue on edison which allows it to run quickly and reduce the time-cost of the serialization.

@PeterCaldwell
Copy link
Contributor

It's definitely true that short term archiving has been broken ever since ACME branched from CESM, but I could have sworn that it operated as I described before that. At the time I was using it, I wasn't involved in the code-level details so perhaps I'm misunderstanding how it worked. In any case, my goal here is to alleviate what I see as a breakdown in communication about 'how things should work'. I'm fine with your 'pairs of jobs' approach as long as serialization cost doesn't kill productivity.

@rljacob
Copy link
Member

rljacob commented May 2, 2017

Sounds good. If you can recall the version of CESM where it worked that way we can check that out, look at the short term archiver code and confirm if something was lost.

@mfdeakin-sandia
Copy link
Contributor

The CIME issues for these implementations are here: ESMCI/cime#1503 ESMCI/cime#1485

@rljacob
Copy link
Member

rljacob commented May 9, 2017

jgfouca added a commit that referenced this issue May 15, 2017
Short Term Archiving Features

This implements features to the short term archiver to enable running
it while the model is without obviously breaking things (see
ESMCI/cime#1503 for potential issues with the --last-date option).
Other options added include --copy-only, which copies the files to be
archived instead of moving them; and --no-incomplete-logs, which
ignores logs which are not gzipped, and thus not complete

Fixes #1305
Passes scripts_regression_tests
BFB

* origin/mfdeakin-sandia/in_run_archive:
  Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move
  Adds a warning when using the --last-date option and to its help
  Implement the copy_only option for short term archiving. This copies files rather than moving them
  Implemented most of the machinery for testing with "incomplete" log files
  Fix code format issue - replace unused variable with _
  Update template.st_archive
  Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
jgfouca added a commit that referenced this issue May 15, 2017
Short Term Archiving Features

This implements features to the short term archiver to enable running
it while the model is without obviously breaking things (see
ESMCI/cime#1503 for potential issues with the --last-date option).
Other options added include --copy-only, which copies the files to be
archived instead of moving them; and --no-incomplete-logs, which
ignores logs which are not gzipped, and thus not complete

Fixes #1305
Passes scripts_regression_tests
BFB

* origin/mfdeakin-sandia/in_run_archive:
  Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move
  Adds a warning when using the --last-date option and to its help
  Implement the copy_only option for short term archiving. This copies files rather than moving them
  Implemented most of the machinery for testing with "incomplete" log files
  Fix code format issue - replace unused variable with _
  Update template.st_archive
  Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
jgfouca pushed a commit that referenced this issue Jun 2, 2017
Allow the case.st_archive script to work with mpaso and mpascice history and restart files.

Also should work with mpasli but not tested.

From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components.

Fixes #1305
S2-131 #close
[BFB]

* rljacob/cime/fix-mpas-starchive:
  fix mpas pattern matching so only interim restart files are deleted
  Add ability to archive MPAS land ice files
  Add ability to handle mpas files
  Change regex for mpaso and mpascice files
jgfouca added a commit that referenced this issue Jun 2, 2017
Short Term Archiving Features

This implements features to the short term archiver to enable running
it while the model is without obviously breaking things (see
ESMCI/cime#1503 for potential issues with the --last-date option).
Other options added include --copy-only, which copies the files to be
archived instead of moving them; and --no-incomplete-logs, which
ignores logs which are not gzipped, and thus not complete

Fixes #1305
Passes scripts_regression_tests
BFB

* origin/mfdeakin-sandia/in_run_archive:
  Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move
  Adds a warning when using the --last-date option and to its help
  Implement the copy_only option for short term archiving. This copies files rather than moving them
  Implemented most of the machinery for testing with "incomplete" log files
  Fix code format issue - replace unused variable with _
  Update template.st_archive
  Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
jgfouca pushed a commit that referenced this issue Feb 27, 2018
Allow the case.st_archive script to work with mpaso and mpascice history and restart files.

Also should work with mpasli but not tested.

From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components.

Fixes #1305
S2-131 #close
[BFB]

* rljacob/cime/fix-mpas-starchive:
  fix mpas pattern matching so only interim restart files are deleted
  Add ability to archive MPAS land ice files
  Add ability to handle mpas files
  Change regex for mpaso and mpascice files
jgfouca added a commit that referenced this issue Feb 27, 2018
Short Term Archiving Features

This implements features to the short term archiver to enable running
it while the model is without obviously breaking things (see
ESMCI/cime#1503 for potential issues with the --last-date option).
Other options added include --copy-only, which copies the files to be
archived instead of moving them; and --no-incomplete-logs, which
ignores logs which are not gzipped, and thus not complete

Fixes #1305
Passes scripts_regression_tests
BFB

* origin/mfdeakin-sandia/in_run_archive:
  Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move
  Adds a warning when using the --last-date option and to its help
  Implement the copy_only option for short term archiving. This copies files rather than moving them
  Implemented most of the machinery for testing with "incomplete" log files
  Fix code format issue - replace unused variable with _
  Update template.st_archive
  Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
jgfouca pushed a commit that referenced this issue Mar 14, 2018
Allow the case.st_archive script to work with mpaso and mpascice history and restart files.

Also should work with mpasli but not tested.

From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components.

Fixes #1305
S2-131 #close
[BFB]

* rljacob/cime/fix-mpas-starchive:
  fix mpas pattern matching so only interim restart files are deleted
  Add ability to archive MPAS land ice files
  Add ability to handle mpas files
  Change regex for mpaso and mpascice files
jgfouca added a commit that referenced this issue Mar 14, 2018
Short Term Archiving Features

This implements features to the short term archiver to enable running
it while the model is without obviously breaking things (see
ESMCI/cime#1503 for potential issues with the --last-date option).
Other options added include --copy-only, which copies the files to be
archived instead of moving them; and --no-incomplete-logs, which
ignores logs which are not gzipped, and thus not complete

Fixes #1305
Passes scripts_regression_tests
BFB

* origin/mfdeakin-sandia/in_run_archive:
  Adds the --force-move option and implies --copy-only when --last-date is specified without --force-move
  Adds a warning when using the --last-date option and to its help
  Implement the copy_only option for short term archiving. This copies files rather than moving them
  Implemented most of the machinery for testing with "incomplete" log files
  Fix code format issue - replace unused variable with _
  Update template.st_archive
  Adds options to the st_archive to specify the last date (--last-date) to archive, and whether to disable archiving incomplete log files (--no-incomplete-logs)
rljacob added a commit that referenced this issue Apr 16, 2021
Allow the case.st_archive script to work with mpaso and mpascice history and restart files.

Also should work with mpasli but not tested.

From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components.

Fixes #1305
S2-131 #close
[BFB]

* rljacob/cime/fix-mpas-starchive:
  fix mpas pattern matching so only interim restart files are deleted
  Add ability to archive MPAS land ice files
  Add ability to handle mpas files
  Change regex for mpaso and mpascice files
rljacob added a commit that referenced this issue May 6, 2021
Allow the case.st_archive script to work with mpaso and mpascice history and restart files.

Also should work with mpasli but not tested.

From the case directory, executing ./case.st_archive should move all history and restart files to the short term archive for all ACME components.

Fixes #1305
S2-131 #close
[BFB]

* rljacob/cime/fix-mpas-starchive:
  fix mpas pattern matching so only interim restart files are deleted
  Add ability to archive MPAS land ice files
  Add ability to handle mpas files
  Change regex for mpaso and mpascice files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests