-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File output size and chunking #743
Comments
Hi @TonyB9000, If you have a look at the CMOR API there is a call to https://cmor.llnl.gov/mydoc_cmor3_api/#cmor_set_deflate that allows you to set the deflation and bit-shuffle* option for a variable, before you start writing data. Note that deflation comes at a cost for users of the data; deflation above level 2 is not recommended and I have a vague recollection that compression of coordinate variables can cause issues for users too. I thought that this recommendation was documented somewhere, but I can only find passing references to it in a few places online. We must do better for CMIP7 *improves the deflation performance. |
Thanks for the info! I see what you mean about "documented". The API mentions only the values you can give, and nothing about the effects it might have. Perhaps I'll experiment with the time-space tradeoffs. Perhaps Charlie Zender will have some insight. I'll also poke the interwebs. Cheers! |
@TobyB9000, There are plans afoot to support more of the netcdf api for reducing precision within CMOR3 as part of a 3.9.x release -- if used appropriately this will help with storage, but there are risks of losing science value in data if too much precision is removed (consider residuals in energy/water budget calculations for example). Please close this issue if you are happy this discussion has covered your query. |
Knowing I can access CMOR deflate and shuffle API through the e3sm_to_cmip code is sufficient! Closing Issue, and Thanks! |
@matthew-mizielinski thanks for raising this, FYI, we have a request for coordinate compression in #674, so if there are problems that we need to consider it'd be great to bring these up so we don't create two problems by solving 1 |
This is really hard to effectively document. If you have land-only data, and have your mask assigned correctly, you should get extremely good deflation stats, as ~70% of your grid is missing. Similar thing with sea-ice/siconc and variables that have a huge % of missing data. To correctly document this you'd need to capture these 1) mask differences, 2) data with very large ranges, and very small ranges (e.g. ocean salinity is mostly between 30.0 and 40.0 PSS-78, whereas some other variables span orders of magnitude more values), amongst numerous other data specifics. If I remember correctly, the shift in units from CMIP5 There are also interweb resources that give you some tidbits to consider, e.g. here - Unidata per variable compression example, here - their/Unidata generic compression advice, here - @czender's E3SM NCO lossy compression post and here - DKRZ guidance with some xarray examples of lossy and lossless compression options |
If you're talking lossless compression, you'll get better compression of K (rather than C) because one of the significant figures is either "2" (as in 290 K) or "3" (as in 302 K). In effect the precision of your number in K is lower than the precision in C. |
Ok so my memory was the inverse of reality... Which as I was writing it, I had wondered, unless of course some weird mask quirk did actually lead to my inferred results.. I could of course test this out for myself, but might defer that adventure to another day |
Thanks for the info! I see what you mean about "documented". The API mentions only the values you can give, and nothing about the effects it might have. Perhaps I'll experiment with the time-space tradeoffs. Perhaps Charlie Zender will have some insight. I'll also poke the interwebs. Addendum: The default CMOR output deflate level = 1, and is left unchanged by "set_deflate(varid,True,True,1)". The first "True" applies "shuffle", which gave the 14% file size reduction - but early tests indicate that this incurs a 50+% performance hit. I intend to make this an E2C option, but not the default. |
It appears that Cmorized output can be rendered 14% smaller while retaining BFB-identical data values. If this is true, then (modulo performance issues) cmor output should accommodate by default.
BACKGROUND:
I accidentally created several CMIP6 datasets for Omon variables, where 150 years was output to a single file. Example:
The size (12.5 GB on disk) was considered excessive, so I sought the advice of NCO developer Charlie Zender to learn if there were a means of breaking this output file into smaller "20-year" segments. This was accomplished with multiple calls to "ncrcat":
ncrcat -O -d time,<start_month_offset>,<end_month_offset> <inputfile> <outputname>
By this, I obtained:
However, the sum of these sizes is only 10,741,987,385 bytes (86% of original size).
Alarmed that something was amiss, I set about revamping my E2C control script to segment input by years, and cycle over calls to E2C accordingly. The result:
and their sum is back to 12535711546, close to the original 1-file size of 12.53GB.
I was pleasantly surprised when Charlie tested both with “ncdiff”, and concluded that the files (data-wise) were BFB identical. (But are they both “legitimate CMIP6-identical?)
Using “
ncks -D 2 --hdn -M -m
” to expose some “hidden” metadata, the difference is revealed:Post-CMOR:
lat:_Storage = "contiguous" ; // char
Post-ncrcat “splitter”:
I don’t honestly know the technical differences in storage format, or if one allows greater compression, or if there is a performance hit here. The post-generation “splitter” requires about 1 minute/GB (with 1 worker). But if it were possible to save ~15% in disk storage and network transfer size, this seems like something one would want to pursue – all else being equal. I just wonder if is possible to go with “Chunked” storage natively in CMOR output, rather than post-process the files.
The text was updated successfully, but these errors were encountered: