Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very slow parallel ArrayMesh? #14

Open
pmcdonal opened this issue May 25, 2022 · 6 comments
Open

very slow parallel ArrayMesh? #14

pmcdonal opened this issue May 25, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@pmcdonal
Copy link
Member

Computing IC power for Abacus cubes, the time for mpi.size>1 is dominated by the line:
mesh_init = pypower.ArrayMesh(mesh_init,L,mpiroot=0)
which jumps from ~7 seconds for one process to ~90s for 2 or 32 processes on cori (and worse, ~4s to 150s on my laptop, so not cori-specific)

Of course can just plow through it, but might as well cut down on friction where possible...

seems to come from
pmesh/pm.py, line ~445

mesh_init being:

  if not mpi.rank():
    with asdf.open(ic_file, lazy_load=False) as af:
      mesh_init = af['data']['density']
  else:
      mesh_init=None

(is there a way to parallel read asdf?)

@pmcdonal pmcdonal added the bug Something isn't working label May 25, 2022
@pmcdonal
Copy link
Member Author

Somehow lost that that pm.py line is:
mpsort.permute(flatiter, argindex=ind.flat, comm=self.pm.comm, out=self.flat)

@adematti
Copy link
Member

adematti commented May 25, 2022

Just a quick comment before going to sleep:
not sure I'll be very useful here as these are Yu Feng's routines --- but I can try to help.
I got the unravel() trick from nbodykit, https://github.com/bccp/nbodykit/blob/4aec168f176939be43f5f751c90363b39ec6cf3a/nbodykit/source/mesh/array.py#L62, which enforces all ranks but root to have 0 size array. I'd guess we could avoid that as long as the flattened mesh is split (in natural order) across all ranks.

What is the mesh shape?

About asdf, I usually just read the rows of interest for each process, e.g. https://github.com/cosmodesi/mpytools/blob/6f2766ea00b5f316f70e221672cf8d41ac6166f4/mpytools/io.py#L969. Since only slices (start, stop, step) are supported in asdf slicing (if I remember correctly), this should do the right thing, i.e. only read the relevant rows for each process. Not sure this is faster than non-parallel io, though (haven't tried much).

@pmcdonal
Copy link
Member Author

Yes, the pmesh/pm.py line I mentioned is coming from this unravel() line. This makes it several times faster to just run the whole thing in 1 process (while standard CatalogFFTPower is much faster with mpi, i.e., my mpi is working). In the grand scheme of things this particular case is unimportant, so I will try this asdf read thing just out of curiosity and then forget about it for now. (actually though, it looks like ArrayMesh assumes the data is not distributed, i.e., mpiroot=None doesn't look valid?)(which I guess makes sense... a mesh being different to distribute than list of objects)(I hadn't really thought about it, having gotten used to the catalogs)

@adematti
Copy link
Member

(actually though, it looks like ArrayMesh assumes the data is not distributed, i.e., mpiroot=None doesn't look valid?)
=> yes, that's the part I got from nbodykit and may be relaxed to accept distributed arrays, as long as they are distributed with increasing C index. I may try to allow for the distributed version at some point (if you do not try first!)

@adematti
Copy link
Member

commit acba368 should allow to pass distributed array to ArrayMesh, e.g.
mesh = ArrayMesh(distributed_array, boxsize=boxsize, nmesh=shape, mpiroot=None)
(nmesh must be provided in this case)
this may still not help with the slowness issue; there may be room for improvement in the specific case of the full mesh hold by a single rank, but I would need more details for testing purposes: the mesh shape or better the path to ic_file

@pmcdonal
Copy link
Member Author

The file is
/global/cfs/cdirs/desi/public/cosmosim/AbacusSummit/ic/AbacusSummit_base_c000_ph000/ic_dens_N576.asdf

This parallel read does work (producing same results). mpi.size>1 is faster, but still overall not as fast as mpi.size=1. I'm happy to leave this until it comes up somewhere as a real obstacle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants