-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison to Xarray-Beam #117
Comments
Hi @shoyer, thanks for the great questions! I haven't done any performance comparisons, but a lot of effort has been made to make Cubed scale horizontally, with conservative modelling/prediction of memory usage. So the emphasis so far has been on reliability over raw speed, with the goal that a user can leave a job running and have high confidence that it will complete successfully. (Hence the inclusion of runtime features like retries, and backup tasks for straggler mitigation, #14.) The two main runtimes I have been working with, Modal and Lithops, are "pure serverless" and don't have a shuffle - and I think that is a use case Cubed should support. But it may be interesting to explore a part of the solution space where intermediate data is passed through a shuffle (like Xarray-Beam does). Early on I did some experiments to get Cubed running on Cloud Dataflow (and the unit tests are run on Apache Beam), but there is a lot more to do to translate a Cubed DAG to a Beam DAG that takes advantage of the worker resources efficiently. I think Cloud Dataflow would be a great runtime target for Cubed, even if the shuffle isn't used (or is available as an option).
Cubed delegates rechunking to rechunker's algorithm, and I think the pertinent point here is that the target Zarr file is copied from the source (or intermediate) according to the target's chunk sizes, so when reading from the source it may cross (the source's) chunk boundaries. In the relatively prime case a chunk of size 999 is read from the source (chunksize 1000) and written to the target (chunksize 999) - and some source reads will cross chunk boundaries. (Typically more than one target chunk is processed in one go, for efficiency, but the effect is the same.) Zarr is capable of reading parts of an array that cross chunk boundaries - so I'm not sure this is a problem. I don't see where it needs to rechunk to size 1 - though maybe I'm missing something! BTW While looking at this question I stumbled across your rechunker proposal for multi-stage rechunking (pangeo-data/rechunker#89). That looks like a very useful addition - perhaps we should revive it? Finally, I should add that the Zarr's regular chunk-size limitation does have implications for implementing some array API operations, such as boolean indexing, see #73. So we'll need a solution anyway eventually. |
Ah, good point. In that case you should be OK.
Yes, would love to do this! That said, my attempts at formulating the multi-stage algorithm all needed irregular chunking. |
Love to see this conversation happening. In my latest work on Pangeo Forge, I have implemented a new single-stage rechunking algorithm in Beam that does not rely on any intermediate storage. It takes advantage of beam's ability to emit multiple elements from each task together with groupby. It is tested and seems to work, but have not tried it at scale. |
This looks almost exactly like the I tested it at moderately large scale (ERA5 surface variables, ~1.5 TB each) and it works well, though multi-stage chunking makes it 3x faster/cheaper (see pangeo-data/rechunker#89 (comment) for details). At some point, I'll probably test this on the ERA5 model-level variables, which are ~200 TB each. |
Yes I realize I am rewriting lots of xarray-beam with in Pangeo Forge 😝. I am justifying this as:
The long term idea is to align as much functionality as possible and then start to deprecate overlapping features. Tom, sorry to hijack your issue tracker to discuss other projects! 🙃 |
That's fine! This is all very interesting - I feel it should be possible to combine efforts somehow... |
Thinking about this more, it should be possible for Cubed to delegate to Xarray-Beam for its two "primitive ops" (https://github.com/tomwhite/cubed#design): blockwise and rechunk. Cubed implements the whole of the array API using these two ops, so you can thing of Cubed as focusing on that side of things, and Xarray-Beam on providing a massively scalable implementation of these primitives. Of course, Xarray-Beam doesn't provide a blockwise operation (yet), but I have started prototyping what it might look like in Cubed, and it seems promising. This works because a Cubed array is basically just a "chunked array", and so is Xarray-Beam's core data model. The idea here is to just use a single variable in Xarray-Beam to represent the array. I think that the Xarray interface for users would be layered on top of Cubed, but that needs more thought and can come later (see pydata/xarray#6807). BTW the memory management that flows from Cubed is very relevant too, since it provides bounds on each operation, which means that we know that the computation it passes to Beam will fit in worker memory. Thoughts? |
I've created a prototype that does this here: https://github.com/tomwhite/cubed/tree/xarray-beam The basic idea for blockwise is to use Dask's Beam's This works inasmuch as it passes a large proportion of unit tests. However, when run on Dataflow with 20GB inputs (each chunk being 200MB) I'm getting out of memory errors in the |
Wow, very cool! I do wonder whether the overhead of building seperate pcollections for each array will turn out to be problematic (vs. putting all arrays in an xarray.Dataset into a single pcollection). This could potentially make relatively large Beam graphs for cases like Datasets of ~20 arrays (although still small relative to what Dask deals with). I also worry about the overhead of doing a shuffle due to calling
Do you know what sort of workers you're running on? The default memory limit on Cloud Dataflow may be pretty small. You might try using Dataflow Prime: https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime (I can also ask some of my colleagues who use Dataflow more regularly) |
I hadn't thought of this. It should be possible to put everything into one pcollection, and extend the key to include the array's name.
I share your concerns. On the optimization point, Cubed does have a fuse optimization for maps, however it's not wired in for the Xarray-Beam prototype I did here. In the prototype, for simplicity I opted to translate array API calls directly to Beam calls, so that the Beam DAG is constructed as the computation is built up. To take advantage of the Cubed optimizations, it would be possible to change this to build up a Cubed DAG first (just like the other executors), which is then optimized, then converted to a Beam DAG. (I will open issues for the optimizations you mention that have not been implemented yet.) On the broader point though, I do worry about the overhead of a shuffle for a blockwise operation with two or more inputs. (A single input can be efficientlyimplemented as a In some sense, the approach of the other executors in Cubed, such as the Beam executor (as opposed to the Xarray-Beam prototype), is to provide a zip-like operation on Zarr chunks, so it may well be that using that approach may be fine. It's the kind of thing that could be benchmarked.
Thanks. I am using the default worker which has 3.75GB of memory. With memory issues it's quite hard to see what operations are using. I've had some success with Fil and |
I would be curious how the cubed approach compares in performance to my Xarray-Beam library, beyond the superficial differences (NumPy vs Xarray data): https://github.com/google/xarray-beam
One issue that comes to mind with storing all data in Zarr is the regular chunk-size limitation. For example, can you efficiently rechunk between arrays with relatively prime chunk sizes (e.g., from 1000 to 999)? I think doing this efficiently requires irregular chunk sizes, or you end up rechunking everything to size 1, which can be super slow.
The text was updated successfully, but these errors were encountered: