Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory "management" issue with intel #1322

Open
guillaumevernieres opened this issue Oct 11, 2024 · 11 comments
Open

Memory "management" issue with intel #1322

guillaumevernieres opened this issue Oct 11, 2024 · 11 comments
Labels

Comments

@guillaumevernieres
Copy link
Contributor

The soca variational application takes an insane amount of memory on Hecules and Gaea (~8TB for the simple 3DVAR), both use intel 2021.9.0. The same application on Hera requires ~0.8TB of memory, the intel compiler version on Hera is 2021.5.0.

I have no idea if the compiler is the issue.

@guillaumevernieres
Copy link
Contributor Author

I'm labeling this as soca, but I wonder if it's an issues for the fv3-jedi application as well. @RussTreadon-NOAA , @CoryMartin-NOAA or others, have you tried running the variational application on Hercules lately?

@travissluka
Copy link

fyi @fmahebert

@guillaumevernieres
Copy link
Contributor Author

gnu-vs-intel

path to logs:

/work2/noaa/da/gvernier/runs/profiling-3dvar

@fmahebert
Copy link

Yes, we've seen the intel compiler produce executables that take up way more memory for a little while now. We haven't been able to pinpoint the cause yet. Thanks for opening an issue to track this and to share your measurements.

Related issue (presumably):

@guillaumevernieres
Copy link
Contributor Author

@jswhit
Copy link

jswhit commented Dec 12, 2024

@fmahebert has there been any progress on reducing the memory footprint using intel? This has been blocking us for months now - can't run even a low resolution coupled DA experiment.

@fmahebert
Copy link

@jswhit There's been no progress towards understanding this issue from the JCSDA core team, largely for lack of resources (not a lack of concern).

@dkokron
Copy link

dkokron commented Dec 30, 2024

If you set me up with a small (<= 32 nodes) reproducer on wcoss2, then I can take a look and work with Intel on a solution. I have a complete build of the global-workflow on dogwood.
dogwood:/lfs/h2/hpc/support/daniel.kokron/Projects/GlobalWorkflow/global-workflow

@jswhit
Copy link

jswhit commented Jan 2, 2025

would using gcc for GDASapp (instead of intel) be a potential workaround for this in the short term?

@shlyaeva
Copy link
Collaborator

shlyaeva commented Jan 2, 2025

@jswhit yes, certain compilers/platform combinations certainly work significantly better. E.g. intel + hera doesn't have this issue as Guillaume pointed out. Using gnu on orion/hercules is another option, I think Guillaume had a lot more success with that than with intel.
Bo did a great summary of the results with gnu vs intel on hera and orion in https://github.com/JCSDA-internal/fv3-jedi/issues/1256. It's profiling fv3-jedi + LETKF, but I think it's representative of runs with soca, as well as with Var too.

@jswhit
Copy link

jswhit commented Jan 3, 2025

@shlyaeva I'm mainly interested in running on gaea, but I think the issue there is the same as on hercules/orion (they all use a newer intel compiler than hera).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants