Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow data retrieval for some YTRay objects #2383

Closed
zhafen opened this issue Dec 8, 2019 · 20 comments
Closed

Slow data retrieval for some YTRay objects #2383

zhafen opened this issue Dec 8, 2019 · 20 comments
Assignees
Labels
demeshening Removing the global mesh for particles performance

Comments

@zhafen
Copy link

zhafen commented Dec 8, 2019

Bug report

Bug summary

yt appears to be very slow at reading data from h5py files for use in YTRay objects, especially for particle-based simulations with snapshot size >~ 7 GB.

Code for reproduction

A notebook for reproducing the issue is available here, including ewah file, results of cProfile, and data download links.

Actual outcome
Ray generation is ~30 s per field, for a total of ~10 min when enough fields are retrieved.

Expected outcome
Using h5py to open the full data directly and perform simple calculations takes <5 seconds, so at minimum a factor of ~6 faster. If caching is possible then the speed-up could be much greater.

Notes

  • Most of the time is spent in yt/utilities/io_handler.py:158(_read_particle_selection).
  • The simulation capable of creating this problem has high dynamic range: Particle mass resolution ~7000 Msun, for a total gas mass of a few times 1e12 Msun (i.e. out to a few times the virial radius of a L* galaxy).
  • The raw data is a Lagrangian cosmological zoom-in simulation with data separated into four files. Total file size ~8GB.
  • Intersecting the box edge doesn't significantly change computation time. (Note that this is true for the YT ray, but it will take twice as long for the Trident ray because Trident will break up rays that go beyond the box into smaller rays, each of which is equally expensive.)

Version Information

  • Operating System: Linux
  • Python Version: 3.7.1
  • yt version: 4.0.dev0 changeset 947ebdc
  • Other Libraries (if applicable): h5py 2.8.0, Trident 1.3.dev1

Installation Method

$ # install yt
$ git clone https://github.com/yt-project/yt --branch=yt-4.0
$ cd yt
$ pip install -e .
$ cd ..
$ # install trident
$ git clone https://github.com/trident-project/trident
$ cd trident
$ pip install -e .
@welcome
Copy link

welcome bot commented Dec 8, 2019

Hi, and welcome to yt! Thanks for opening your first issue. We have an issue template that helps us to gather relevant information to help diagnosing and fixing the issue.

@triage-new-issues triage-new-issues bot added the triage Triage needed label Dec 8, 2019
@zhafen
Copy link
Author

zhafen commented Dec 8, 2019

@matthewturk , got a chance to open the issue, continuing from our conversation on Slack. Thanks for your help so far!

@matthewturk
Copy link
Member

Hi @zhafen, I looked at this, and it looked to me like the data was reading the entirety of the dataset every single time -- which definitely does not make sense! So I dug in a bit more and I think that it's a bug, which I took a shot at in #2391. But, it did not speed it up as much as I'd like; there are two additional sets of parameters that might help. The first is to set index_order to a tuple of values (the default is 5, 7) that better divides up the domain when you instantiate ds. The other is to check the CHUNKSIZE value in particle_geometry_handler.py.

I am looking at other possible speedups as well.

@zhafen
Copy link
Author

zhafen commented Dec 13, 2019

Thanks @matthewturk ! PR #2391 provided a ~7.5x speedup.

I played around with index_order but didn't find any choices that provided a further speed-up. Where's the documentation for index_order? I unfortunately wasn't able to find it.

Playing with CHUNKSIZE and decreasing it to 32**3 can result in a speedup of up to ~4x. However the speedup depends on what other actions have been done with yt, when the CHUNKSIZE variable is modified, and the .ewah file. In addition, it can cause a slowdown for other actions, e.g. centering the data.

Between PR #2391 and modifying CHUNKSIZE, YTRays can take ~2 seconds to retrieve data and Trident rays can take ~20 seconds, which is excellent! Thanks so much!

@zhafen
Copy link
Author

zhafen commented Dec 17, 2019

Hi @matthewturk, the tests I applied did indeed indicate a huge speed-up for the sightline I used. Unfortunately it appears that for other sightlines PR #2391 only gives a ~2x speed-up (e.g. this is the case for sightlines that pass through the center of the zoom-in region).

@matthewturk
Copy link
Member

matthewturk commented Dec 18, 2019 via email

@zhafen
Copy link
Author

zhafen commented Dec 18, 2019

I'm finding that under the default settings in the uploaded example notebook (which sends a ray straight through the center of a halo) generating a Trident ray can take ~4-6 minutes depending on CHUNKSIZE.

YTRay itself only takes 10-15 seconds to retrieve data for a given field. However it repeats this process ~20 times to get all the data needed for a Trident ray, thus Trident takes 200-300 seconds.

@brittonsmith
Copy link
Member

Could this be because the ray you're making extends outside of the domain boundaries? Ray objects in yt do not work with periodic boundary conditions, so Trident gets around this by making a series of non-periodic rays and appending their data. Each separate ray object will do those data queries all over again. If this is the case, then perhaps the ray selection can be rewritten to take periodicity into account and do the selection for all sub-segments at once.

@zhafen
Copy link
Author

zhafen commented Jan 5, 2020

The length of the above ray is of order the halo's virial radius, so extending beyond the domain boundaries should not be an issue. I did experience the expected slowness you're describing when I tested a periodic ray previously.

@matthewturk
Copy link
Member

I'm going through the profiling results, and what I'm seeing is two specific things that are taking lots of time:

  1. count_particles_chunks which is used to pre-allocate arrays
  2. dataset.py:__getitem__ which is in h5py, and used to read data

I believe the problem is the first -- the second is a side-effect, and just showing that we are reading data too many times. In fact, it seems we are reading data ... way, way too many times. (I get 838 calls to __getitem__ in h5py.)

Many of the decisions about how to read data came from the idea that we should minimize over-allocation, when in reality that is not the problem if we have subchunking. I am going to spend a little bit of time looking over how we could potentially eliminate that decision.

@matthewturk matthewturk added demeshening Removing the global mesh for particles performance labels Jan 6, 2020
@matthewturk matthewturk self-assigned this Jan 6, 2020
@triage-new-issues triage-new-issues bot removed the triage Triage needed label Jan 6, 2020
@matthewturk
Copy link
Member

Actually, I just issued #2410 that speeds it up by a fair bit on mine.

@matthewturk
Copy link
Member

Looks like that wasn't quite it, although I do have an idea about how to apply the ideas more broadly. This did get me a 2x speedup and I will issue it in another PR (and likely retract #2410) , though, and if you could check it that would be great.

diff --git a/yt/frontends/gadget/io.py b/yt/frontends/gadget/io.py
index 7c03f5cf0..336935f13 100644                          
--- a/yt/frontends/gadget/io.py                             
+++ b/yt/frontends/gadget/io.py                    
@@ -46,12 +46,13 @@ class IOHandlerGadgetHDF5(IOHandlerSPH):    
             for ptype, field_list in sorted(ptf.items()):
                 if data_file.total_particles[ptype] == 0:
                     continue
-                x = f["/%s/Coordinates" % ptype][si:ei, 0].astype("float64")
-                y = f["/%s/Coordinates" % ptype][si:ei, 1].astype("float64")
-                z = f["/%s/Coordinates" % ptype][si:ei, 2].astype("float64")
+                c = f["/%s/Coordinates" % ptype][si:ei, :].astype("float64")
+                x = c[:, 0]
+                y = c[:, 1]
+                z = c[:, 2]
                 if ptype == self.ds._sph_ptypes[0]:
-                    pdtype = f["/%s/Coordinates" % ptype].dtype
-                    pshape = f["/%s/Coordinates" % ptype].shape
+                    pdtype = c.dtype
+                    pshape = c.shape
                     hsml = self._get_smoothing_length(data_file, pdtype, pshape)
                 else:
                     hsml = 0.0

@zhafen
Copy link
Author

zhafen commented Jan 6, 2020

Looks like a nice, simple change to reduce the amount of times the data is accessed. I can confirm that I got a 2x speed-up, and that the results are consistent before and after the change.

@matthewturk
Copy link
Member

I implemented a simple version of this, where we don't do as much (if any? I think I may have missed something somewhere, as this is just a first pass) pre-allocation. I get more speedups. Can you try it out? There may be errors in it, especially with vector fields, so it's not ready for a PR yet.

diff --git a/yt/utilities/io_handler.py b/yt/utilities/io_handler.py
index c79b5cf4b..d919dfd44 100644
--- a/yt/utilities/io_handler.py
+++ b/yt/utilities/io_handler.py
@@ -181,16 +181,16 @@ class BaseIOHandler(metaclass = RegisteredIOHandler):
 
         # psize maps the names of particle types to the number of
         # particles of each type
-        self._count_particles_chunks(psize, chunks, ptf, selector)
+        #self._count_particles_chunks(psize, chunks, ptf, selector)
 
         # Now we allocate
-        for field in fields:
+        for field in []:#fields:
             if field[0] in unions:
                 for pt in unions[field[0]]:
                     fsize[field] += psize.get(pt, 0)
             else:
                 fsize[field] += psize.get(field[0], 0)
-        for field in fields:
+        for field in []:#fields:
             if field[1] in self._vector_fields:
                 shape = (fsize[field], self._vector_fields[field[1]])
             elif field[1] in self._array_fields:
@@ -199,6 +199,8 @@ class BaseIOHandler(metaclass = RegisteredIOHandler):
                 shape = (fsize[field], )
             rv[field] = np.empty(shape, dtype="float64")
             ind[field] = 0
+        rv[field] = []
+        ind[field] = 0
         # Now we read.
         for field_r, vals in self._read_particle_fields(chunks, ptf, selector):
             # Note that we now need to check the mappings
@@ -206,12 +208,17 @@ class BaseIOHandler(metaclass = RegisteredIOHandler):
                 my_ind = ind[field_f]
                 #mylog.debug("Filling %s from %s to %s with %s",
                 #    field_f, my_ind, my_ind+vals.shape[0], field_r)
-                rv[field_f][my_ind:my_ind + vals.shape[0],...] = vals
+                #rv[field_f][my_ind:my_ind + vals.shape[0],...] = vals
+                rv[field_f].append(vals)
                 ind[field_f] += vals.shape[0]
         # Now we need to truncate all our fields, since we allow for
         # over-estimating.
         for field_f in ind:
-            rv[field_f] = rv[field_f][:ind[field_f]]
+            #rv[field_f] = rv[field_f][:ind[field_f]]
+            if ind[field_f] == 0:
+                rv[field_f] = np.empty(0, dtype="float64")
+            else:
+                rv[field_f] = np.concatenate(rv[field_f], axis=0)
         return rv
 
 class IOHandlerExtracted(BaseIOHandler):

@zhafen
Copy link
Author

zhafen commented Jan 7, 2020

Nice! This produced another solid speed-up. With this change and the previous one combined I'm seeing speed-ups of ~2.5-4x.

I did find a minor bug, with the following suggested fix:

@@ -199,6 +199,8 @@ class BaseIOHandler(metaclass = RegisteredIOHandler):
                 shape = (fsize[field], )
             rv[field] = np.empty(shape, dtype="float64")
             ind[field] = 0
-        rv[field] = []
-        ind[field] = 0
+        for field in fields:
+          rv[field] = []
+          ind[field] = 0
         # Now we read.
         for field_r, vals in self._read_particle_fields(chunks, ptf, selector):
             # Note that we now need to check the mappings

The pre-allocation removal changes make sense because for most simulation datasets reading the data will take longer than even typically-slow functions like np.concatenate. That said, are there datasets where pre-allocation would still be important? Maybe datasets that have a large amount of data divided into only a few categories?

@matthewturk
Copy link
Member

Oh, thanks for the fix!

Yes, I think there are likely cases where pre-allocation could help -- specifically, those cases where you are accessing a large collection of data like a dictionary (i.e., a big sim, then doing ds.r[:]["something"]), but I think most operations will now not necessarily need that. One thing we could do, I suppose, is to add a heuristic and switch behavior, but that would be tunable and I'm not sure how productive it would be.

I'm looking into other potential issues with this before proceeding on a PR or proposal, but I am really glad that it too provided a good speedup.

@chrishavlin
Copy link
Contributor

With #2416 merged, this issue may be ready to close. I think it'd be good to re-run the initial profiling just to verify that the version on main does improve performance.

@zhafen
Copy link
Author

zhafen commented Jun 2, 2022

Thanks so much for your work on this! I'm very excited for the related PR. I'll try re-running the initial profiling.

@zhafen
Copy link
Author

zhafen commented Jun 2, 2022

Profiling just finished. With #2416 merged this is a 3x speed-up! Thank you @chrishavlin, @neutrinoceros, @matthewturk, @jzuhone!

@zhafen zhafen closed this as completed Jun 2, 2022
@chrishavlin
Copy link
Contributor

awesome! thanks for re-running!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demeshening Removing the global mesh for particles performance
Projects
None yet
Development

No branches or pull requests

4 participants