-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17
Comments
Keeping in mind these helpful suggestions from @dcherian here: #16 (comment) /// quote:@dcherian /// Some general comments (applicable to latest flox/xarray)
|
What chunking
|
Let's assume that the benefit of working from a re-loaded Given that the native
The number of chunks, number of tasks, and size of chunks for this open with Next step: write "analysis ready data" (ARD) |
(1) I would just call |
Thanks heaps for all the advice above, @dcherian. Re: (4) We're blessed with a fair bit of /scratch space and my current thinking for all these kind of time reduction operations is that it's always best to first convert the many It sounds like my next step is figuring out the best possible way to get that very large |
Sorry I meant
Well yes but then you have a different problem. But yeah if you get this then even medians and quantiles will work, and will work quite well.
|
Thanks again @dcherian - have been using So approach sounds like:
Lastly, by
I assume you are saying that a |
Not at all. It will work very well indeed. Just saying that the need for a rechunked array isn't in your "Objective" listed above. |
all-time base stats completed for 11TB 3D @ daily variablesCluster: MegaMem ncpus=48 mem=2990GB ( "base" stats doesn't include |
With that chunking quantile with flox will work quite well too. I recommend just adding |
Thanks for those recommendations! I have a 3D @ daily variable chunked "for all-time" in an ARD |
If I specify |
@dcherian - further when I try to run (with
I get this error: I print out the chunking right before I call the above:
|
o_O nice find. But yes you just need it installed and
Another aside, it's nice to save the result of gb = ds.groupby(time_dim + '.month')
quant = gb.quantile(...) This will save some repeated effort in setting up the groupby problem.
Hmmm.. potentially a bug. |
Oh right this won't work. Quantiles work by sorting (really, paritioning the data) so you need all time points for each month in a single chunk. Since you have a time-rechunked dataset, use that as an input. |
Nice tip, thank you. Always keen to shave off walltime. |
Thanks for thinking this through. So, looks like I need to fix the problems with my It has been really useful to learn through this discussion that |
Yes everything but quantiles. We could do an See https://github.com/xarray-contrib/flox/pull/284/files if you want to test that out. |
what we learned here has been factored into new code |
Title: Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox.
float32
data over nearly 9000 netcdf file assets in total. For this problem we’ll focus on only the daily 3Dtemp
variable.Gadi
Gadi
supercomputer ( national science resources - not public ) using a custom beta catalog I wrote usingintake-esm
andecgtools
- https://github.com/Thomas-Moore-Creative/BRAN2020-intake-catalogtemp
variable stored in ashort
data type with native chunking like:temp:_ChunkSizes = 1, 1, 300, 300
netcdf
file has one month of daily data and so the length ofTime
varies from 28 to 31 for any one of the 366 files concatenated together. The compressed 5GB singlenetcdf
file becomes 34GB in loadedfloat32
temp
variable xarray object is **Time**: 11138 **st_ocean**: 51 **yt_ocean**: 1500 **xt_ocean**: 3600
and 11.16TB infloat32
xarray_open_kwargs
are employed to increase that, however I believe thatTime
chunks on lazy load can’t be larger than the size of each individualnetcdf
file?xarray_open_kwargs
ofTime:31
we’ll get actual chunks like: [31,28,31,30,31,30 . . . ] and some years will be [ …. 31,29,31,30 . . .]median
andquantile
) as both monthly and daily climatologies from daily data organised as 366 x one-month-per-netcdf file.xarray_open_kwargs
do we use on the initial loading.open_mfdataset
?ChunkSizes
groupby
operations viaflox
?zarr
collection be useful. If so whatTime
chunking to choose? all-time ({’Time’:-1}
or a monthly frequency?flox
methods are best to choose?The text was updated successfully, but these errors were encountered: