Towards reductions along a subset of axes #85

davidhassell · 2023-06-02T10:19:33Z

@bnlawrence and I have just been chatting about issues around using active storage to create reductions along a subset of axes (e.g. calculating the temporal mean at each point in X-Y space, that creates a logically 2-d array).

This might not make too much sense (:)), but is an attempt to capture what we said before we forget it.

The gist was that reducing over a subset of axes is a problem for PyActiveStorage, and not StackHPC. We would tell PyActiveStorage (PAS) that we wanted a subset of axes, and PAS would then translate that to whatever server-side-storage-chunk slices are needed to deliver that, and then PAS would then combine these to the appropriate N-d array ready to be passed back to Dask/cf-python.

Potential pitfalls that occur to me:

Performance: when taking the T average of the (T, Y, X) array, there will be X times Y requests, rather than one request per storage chunk
StackHPC would have to pass pack the result's location, as well as the data and sample size, etc.

Brain dump over. Hopefully this will all make sense whenever we read this next (probably not before September 2023!).

markgoddard · 2023-06-05T08:23:35Z

Is this a case of it being too complicated to offload to S3 active storage initially, or do you see this as being an operation where S3 active storage cannot add value?

I'm just wondering whether we could accept a list of selections upon which to take a sum/count, and return a list of results to be combined as necessary by PyActiveStorage.

sd109 · 2023-07-05T13:57:38Z

Just while I remember, in the wider Excalidata meeting last Friday the DDN team mentioned some technical difficulty in their active storage implementation that would make it difficult to return the result of a single reduction if that result was over something like 4kB in size (I may be mis-remembering the exact number). This is obviously not a problem for a simple reduction but, as @bnlawrence and I discussed in that meeting, it might end up being an issue if we are interested in returning reductions along subsets of axes as described here since reduction results could then become arbitrarily large. This means that it might only be possible to handle such functionality on the PyActiveStorage side if we are to have matching functionality between the S3 and Posix implementations.

I guess more generally, we should probably have some shared list of discussed feature enhancement ideas that we can run past the DDN team too before we stray too far from what's possible on their end.

markgoddard · 2023-07-05T15:03:13Z

Perhaps where these limits apply for a storage backend we could slice up the request into sufficiently small batches then aggregate in PyActiveStorage?

bnlawrence added the enhancement New feature or request label Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards reductions along a subset of axes #85

Towards reductions along a subset of axes #85

davidhassell commented Jun 2, 2023

markgoddard commented Jun 5, 2023

sd109 commented Jul 5, 2023

markgoddard commented Jul 5, 2023

Towards reductions along a subset of axes #85

Towards reductions along a subset of axes #85

Comments

davidhassell commented Jun 2, 2023

markgoddard commented Jun 5, 2023

sd109 commented Jul 5, 2023

markgoddard commented Jul 5, 2023