Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking schema #135

Closed
dnadeau4 opened this issue Feb 23, 2017 · 6 comments
Closed

Chunking schema #135

dnadeau4 opened this issue Feb 23, 2017 · 6 comments

Comments

@dnadeau4
Copy link
Collaborator

@taylor13 @durack1 @doutriaux1

I have been working on the "chunking" attributes and pushed in cmor-3.2.2 a new "chunking" schema.

It is very difficult to chose one schema that will satisfied everybody as you can see in this document.
chunking_data_why_it_matters

I decided to chunk following the Spatial Access cross section (see above link) since we usually access one level of global data over time. So the chunking will be (1,1,Ysize, Xsize) for (time, level, lat, lon)

@ehogan Let me know how this impacts your file size, and if this is acceptable.

@durack1
Copy link
Contributor

durack1 commented Feb 23, 2017

@dnadeau4 I personally would be accessing the time axis first, as for most applications you want the temporal history rather than a single time slice

@dnadeau4
Copy link
Collaborator Author

Well, if chunk the data using time series access schema, I create core samples of data for each lat/lon. This is not what most people want when they run a model.

Chunking with (TimeSize, 1, 1, 1) will take a huge amount of disk space and thereafter accessing (1,1,Ysize, Xsize) will be very slow.

@durack1
Copy link
Contributor

durack1 commented Feb 23, 2017

@dnadeau4 I really wonder whether this is a good idea, the performance hit is huge for making assumptions about how users are going to access the data.. The defaults look pretty good to me..

Storage layout, chunk shapes Read time series (sec) Read spatial slice (sec) Performance bias (slowest/fastest)
Contiguous favoring time range 0.013 180 14000
Contiguous favoring spatial slice 200 0.012 17000
Default (all axes equal) chunks, 4673 x 12 x 16 1.4 34 24
36 KB chunks, 92 x 9 x 11 2.4 1.7 1.4
8 KB chunks, 46 x 6 x 8 1.4 1.1 1.2

@dnadeau4
Copy link
Collaborator Author

dnadeau4 commented Feb 23, 2017

I could let netCDF4 default the chunksize and expose a attribute variable for flexibility.

  • example:

    • "tas": { "chunksize_dimension": [512,512,1,1] } CMOR will let the user overwrite this attribute for the user needs. The user will need to know that each value will chunk the array according to the dimension attribute. i.e. "dimension": "longitude latitude plevs17 time"

@dnadeau4
Copy link
Collaborator Author

I should expose the "set cache size" as well...
HDF5 white paper

@dnadeau4
Copy link
Collaborator Author

dnadeau4 commented Mar 3, 2017

I have let the set chunking otherwise the default chunk is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants