-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow regional runs with nuopc vs mct #1907
Comments
Thanks for starting this, @jkshuman . I would skip the NTASKS_PER_INST setting... that may be relevant for a multi-instance / multi-driver case, but otherwise I think that just will confuse this investigation. (Unless others know something that I'm missing.) |
@billsacks I added a run which removes the change to ntasks_per_inst
|
adding @billsacks comments on the potential culprit here: (1) Jackie's runs use a resolution of USRDAT (2) The USRDAT resolution sets the default PE layout to use a single task (3) This block of code sets PIO_TYPENAME to netcdf when using a single task: One workaround for this, besides explicitly setting PIO_TYPENAME via an xmlchange, is to specify --pecount on the create_newcase line so that the number of tasks is set to something greater than 1 from the start, which in turn keeps PIO_TYPENAME as pnetcdf. Off-hand, I'm struggling a bit to come up with a robust way to get the correct settings out of the box for various situations. Would it make sense to have separate USRDAT_1pt and USRDAT_regional (or something like that) just for the sake of setting the default PE layouts differently??? I imagine that might be messy, though. Another option might be to move the above CIME code to case.setup time, so as long as you change the PE layout before your first call to case.setup, it wouldn't be invoked – but changing that setting at that point might come with its own issues (e.g., confusion if the user has already tried to manually set it), so I'm not sure whether that's a good idea. This might take some more brainstorming. |
@slevisconsulting asked:
That's a good question. It occurred to me and I assumed that it would be a problem for some reason that motivated the change in CIME, but I'll check with Jim Edwards to see. |
I found this note from Jim from 6 years ago:
which is probably what inspired the use of netcdf for single-processor cases. |
@jkshuman - I think there was something wrong with your case setup scripts. The original case has NTASKS=8 for all components - which maybe was what you intended, though I seem to remember that your real original case had NTASKS_ATM=36 (i.e., 1 full node). But the other cases all have NTASKS=1... maybe you didn't change that setting? If you want to avoid redoing multiple tests, you could just do one additional test that is exactly like the original (i.e., NTASKS=8) but with PIO_TYPENAME=pnetcdf. Or you could do two additional tests: one with NTASKS=36 (or equivalently NTASKS=-1) and one with that setting plus PIO_TYPENAME=pnetcdf. |
I did take a quick look through your timing file from the original case, though, and I think this might just be a matter of needing to throw more processors at it: almost all of the time is spent in CTSM (not DATM or the coupler), and of the lnd run time, 58% is spent in canflux, which I don't think involves any I/O. So I don't think that tweaking I/O settings will make much difference, and the main thing you can do to speed it up is to throw more processors at it – at least one full node (36 processors), and maybe more given the size of this region. If you do have an apples-to-apples comparison of a nuopc and mct case that shows that the nuopc case is running significantly slower, that would be interesting to see. I'd be surprised based on what I can see in the timing file, but sometimes surprises can be fun :-) |
@billsacks I was using a MOSART compset so that may have contributed. the mct case is getting underway. will update the paths when the apples to apples comparisons are done. |
@billsacks I updated the nuopc runs, but I am having a hard time getting a successful mct run. I am going to abandon mct for the moment, but happy to try again if there are any ideas. I tried with CLM5.1, CLM5.0 for mct but keep getting a variation of this error: 256:MPT ERROR: Rank 256(g:256) is aborting with error code 2. |
That looks like a problem with mapping files being inconsistent with the domain. 55296 is the size of a 0.9x1.25 domain, I think, so you probably have one or more files (e.g., mapping files) that are from that rather than your regional domain. If you have an old MCT case sitting around that showed the kind of speed you're expecting, you can point me to that and no need to try to reproduce this exactly apples-to-apples. |
Indeed, part of the problem may be that mosart is still on, as evidenced by an rof log file in your run directory, despite the NULL setting. I've had problems turning off mosart using NULL after creating a case. This might work, I don't remember: |
@olyson I had tried running with SROF and got a fail, so just went back to MOSART. Will try again with SROF for a long name compset. |
Thanks for the discussion on this. (@billsacks I must have had a typo in my path for the mct domain directory, but I fixed the domain path and have a somewhat comparable mct run listed above). To complete this set of tests I ran these with SROF using the long name compset. I had fails with the alias I2000Clm51FatesRs, but have not explored that fail further. The test paths are updated at the top of the issue. @billsacks is the minimum recommendation for these region subsets
|
The important of using the regular JOB_QUEUE is to prevent the use of the SHARED queue, which can have impacts on performance since you are sharing the node with others, so what they do can have greater impacts on your program run. Technically the only way to get no interference from others is to have a dedicated machine, which isn't really possible. But, at least not being in the shared queue is important. NTASKS==-1, this will depend on the users case and you can go up to the number of grid cells in your regional domain. NTASKS==-1 means use a full node, which on cheyenne is 36 tasks. You do normally want to use full nodes, so using the minus syntax is helpful for that. |
Sorry for my delay in getting back to this. I looked at the new cases (though not the ones where you used MOSART with NULL mode), and this really looks like it's mainly (or entirely) a matter of needing to throw more processors at the problem. The switch from netcdf to pnetcdf doesn't help for your original 8-task case and in fact makes things a little slower – though that could just be machine variability, or it could be that pnetcdf really only helps for larger processor counts. The main contributor to the run time is CTSM (not DATM or the coupler); of this, the main contributor is can_iter, but fates_wrap_update_hifrq_hist, surfalb and surfrad also take significant time. This looks to me like it's probably a computational or memory bottleneck in CTSM, not anything to do with using CMEPS or CDEPS. When you increased from 8 to 36 processors, the runtime improved considerably, though not quite linearly with the number of processors. The MCT case you ran uses a very different processor layout:
It looks like what's going on here is that, when you setup the MCT case, you started with an f09 case and then changed things from there; whereas with the NUOPC case, you started with USRDAT resolution and then changed things from there. So I guess the out-of-the-box PE layout for the MCT case was similar to that of an f09 case, though I'm confused as to why you got this particular PE layout, because our standard f09 layout on cheyenne uses a lot more processors than that. Whatever the explanation, I have a feeling that the differences you're seeing are due to the differences in processor counts more than anything else. I'm not sure if this totally explains the differences, because the MCT case gives a 13x improvement in land run time for a 8x increase in processors. But I have seen better-than-linear scaling like this before when there are memory bottlenecks, so I wouldn't be surprised if most / all of this can be explained by the different processor layout. So can you try the following in a NUOPC case?:
|
@ekluzek partly replied to this, but yes, I think this is right, though as @ekluzek says the ntasks recommendation would depend on the number of grid cells. |
@billsacks updated this with a nuopc case using your recommendations.
case: /glade/work/jkshuman/FATES_cases/test/setup_bsacks_ntasks_rootpe_piotype_SROF_SAmer_543e4243a_d63b8d21 |
Thanks a lot @jkshuman . That NUOPC case gets closer to the timing of the MCT case, but is still slower – lnd run time of 0.67 sec/day instead of 0.40 sec/day. But I think I see why: It looks like your MCT case uses a domain file with nearly 1/2 of the grid cells masked out as ocean, whereas the regional mesh file used in your NUOPC case has a mask that is 1 everywhere. This leads to the following: For the MCT case:
For the NUOPC case:
I'm not sure what the recommended way (if any) is for setting the mask on the NUOPC mesh file. But for now, a quick test to confirm that this explains most / all of the timing difference would be to rerun the MCT case changing fatmlndfrc (via the xml variables LND_DOMAIN_PATH and LND_DOMAIN_FILE) to point to a modified version of /glade/work/jkshuman/sfcdata/domain.lnd.fv0.9x1.25_gx1v6.SA.nc, where you change the "mask" variable to be 1 everywhere. Then the MCT case would be using a mask consistent with the NUOPC case; I know this isn't the mask you want to be using, but it would give us more of an apples-to-apples comparison. |
@billsacks thanks for looking at this. It makes sense. Will update here when I get around to modifying the MCT domain file as suggested for that comparison. It seems that a longer term solution would be to modify the subset script to mask out the ocean. |
@jkshuman Yes, the subset script could modify the mask based on the global mask from the mesh file. The other way to do this now, would be to use the mesh_modifier script to get the right mask on the regional mesh file. That's something you could do with the existing tool. Although there's a bit to figuring out how to use it. But, @slevisconsulting or I could help if you need some guidance... |
@ekluzek Thanks for that guidance. I will look into using the existing mesh_modifier script, and get in touch for help. |
That may have been part of the original motivation for subsetting the global mask rather than creating a new mask from scratch based on the coordinates. |
Is there a recommended setting if you only run CLM-FATES with MCT? Here's my setup, but it's still slow. |
@niuhanlin we recommend that you switch over to NUOPC, and then optimize based on your specific needs. MCT is no longer supported. On that note, I should close this issue, as it is likely the additional ocean tiles that cause the slower performance. Any disagreement on closing this @ekluzek @billsacks |
Recommendation was to test this again w/ the ocean tiles masked out. I have not had time to perform that test. But based on discussion and review by @billsacks that is the key difference. |
@niuhanlin - in most cases, we recommend using a single thread (NTHRDS_* = 1). As @jkshuman says, this issue is essentially resolved, and I agree that it can be closed. @niuhanlin if you want further support, I recommend opening a new Discussion topic (https://github.com/ESCOMP/CTSM/discussions) or forum post (https://bb.cgd.ucar.edu/cesm/). There, please give details on your configuration so we can give you better guidance. |
In testing #1892 the regional runs are significantly slower than similar runs with mct.
Serious of 10 day runs with allow us to investigate changes in setup and timing.
case directories:
Mosart compset (nullMosart that did not set to null...)
tagging @billsacks @ekluzek @slevisconsulting for discussion
definition of done: recommendations for regional case setup
The text was updated successfully, but these errors were encountered: