-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted files on Betzy #253
Comments
I have had corrupted files for two experiments on Betzy. However, for both cases it might have been also related to the compression/archiving script. I noticed that the files were corrupt once they were in the archive folder on Betzy.
|
From support @Sigma2 : Hello all, Again, many thanks for all the information on short notice, that was very helpful. Also very sorry for the inconvenience this has caused in the past ... it just seemed not so easy to reproduce. As the ops log entry states if you experience data corruption don't hesitate to contact support. Might be good to reference this case. |
Hi, I am having a related issue with four simulations on Betzy where several of the cam.h0. files are corrupted in the archive folder. They have file sizes of somewhere between 2MB and 290 MB, where the normal file size is roughly 350MB. When I try to open these files with ncdump I just get "ncdump: NHIST_PeffASIA_x2_f19_20211221.cam.h0.1960-11.nc: NetCDF: HDF" error Some of the corrupted files are the _tmp files, but also some are the already-compressed files (and the _tmp file doesn't exist). In one case I have about 10 corrupt .h0. files (excluding the corrupt _tmp files) out of a 30 year simulation for which I think I need to rerun the respective years to generate the output again. I'm running NorESM2-LM with the most recent model tag (2.0.5) and those are my first simulations on betzy, the ones I did on fram were completely fine. Is it best to email sigma2 about this or is this an issue related to the archiving script? Is there a way to get the output from the corrupted files back somehow or do I need to rerun those years? Any help would be greatly appreciated! |
Hi Marianne, @mp586 @monsieuralok I have also experienced it the last 2 weeks. I think you cannot retrieve the data from the reduced files - the only option is to rerun parts of the experiments again. It is the compression in the archiving script which causes the problem. The problem can be (often) avoided by putting nthreads=2 (instead of 4) in cime/scripts/Tools/noresm2netcdf4.sh. However, even with nthreads=2, I still had recently corrupted files. I hope this helps a bit. Best regards, |
Hi all, since late last year, I've turned off the archive compression flag in env_run for all my experiments due to unpredictable data corruption. Sorry for not bringing this up amongst everyone mostly due to the holiday season, and the fact that it's a bit difficult to pinpoint exact cause here (other than somehow memory is not available). This is related to the specific compression script and the compression commands used there, because I have my own python based conversion script which still works. |
Hi all and thanks for reporting! |
This is one of those long overdue things, but I created now a repo https://github.com/AleksiNummelin/BLOM_utils that for now just includes a Python file and a shell script that can be used for netcdf4 conversion on Betzy (or one can just run the python script on NIRD). I've mainly used the compression on atmosphere, ocean, and ice output, but I think it should work on land and runoff output files as well (and it's not a big loss if it doesn't). I will guarantee that this method will never corrupt files, but it might crash due to memory issues, so depending a bit on the setup and load on Betzy, one might need to adjust the cpu vs memory settings. Usually the most efficient usage is to submit a conversion job per component (atm, ice, ocean). There is also a slight chance that the CMORization will not work, we had some issues with @YanchunHe last year regarding the 0.25 deg ocean output, but it was never clear what the problem really was. I think it might be related to the fact that after the conversion the time variable is not 'unlimited'. @mp586 my suggestion would be to set COMPRESS_ARCHIVE_FILES flag to false in env_run.xml, and then use this script for netcdf4 compression (if you are just testing stuff on Betzy and not moving data over to NIRD, you don't need to do the compression). |
Yes, I can confirm that the missing A quick can be converting the time dimension from fixed to unlimited with NCO: ncks --mk_rec_dmn time input.nc output.nc |
Hi all, thank you so much for sharing those helpful tips on how to avoid the file corruption! I will give it a go! |
I mean after you compressed the file, e.g., with Aleksi's script, and find the 'time' dimension is now a normal fixed dimension, you can convert it to real unlimited time dimension with the above 'ncks' command. And then transfer to NIRD. But you don't need a 'unlimited' time dimension, you don't need to convert. |
Just wanted to report here that I had the same problem on Betzy with my runs, really quite annoying. The same way of compression worked a few months ago and now completely unchanged corrupts some random files. I will also try @AleksiNummelin 's approach. |
Hi, Thank you very much for all the help so far! I ran @AleksiNummelin script and it compressed the data, and then stumbled into a similar issue as @YanchunHe described with the concatenation but for the cam data, so I ran the ncks command, but I still get this error:
Does anyone happen to be familiar with this issue and know how to fix it? Thank you once more for your help! |
Hi, there are a lot of issues with corrupted files on Betzy at the moment. We are working on it, but right now it is probably best to set COMPRESS_ARCHIVE_FILES = FALSE in env_run.xml and use Aleksi's script. @mp586 did you try to run the command on each single file? Or all at once? I think you need to loop through ond set time to UNLIMITED on each file separately... |
Hi, @mp586 can you confirm (1) do all your files have 'unlimited' timeaxis? (2) can you check if your extraction works with existing data that we know is fine (e.g. see paths to piControl experiments at https://noresmhub.github.io/noresm-exp/noresm2_deck/noresm2_mm_piC.html) and not with your data even if timeaxis is 'unlimited'? |
Also @mp586, as a quick fix, I can provide some python code to do concatenation/extraction that will work even if NCL fails... |
Hi again, I've now pushed a modification to the compression script https://github.com/AleksiNummelin/BLOM_utils that sets time to be unlimited (not sure why I didn't do this before, my guess is that it was not an option in xarray at the time). I also added a python script (and a shell script to submit that on Betzy) to fix the files converted with the old version - basically just reading in the file and setting time dimension to be unlimited. |
Great! Thanks @AleksiNummelin |
Hi all, Thanks again for the quick responses and the new script! I will try that now. Thank you again for your help :) |
I see, @mp586 then it sounds like something else is going on (and the new script fixing the time dimension probably doesn't help). Can you post here again the command you are using (or send email)? |
@AleksiNummelin the problem seems to be with using xr.open_mfdataset('*.nc'), weirdly it is not an issue with all files, but only some! I am just wondering right now whether it's not an issue related to the compression script that you sent, but whether those files might have been corrupted beforehand. I only reran single years of my simulations where I could not open the files, but then I ran the compression script over all years. The ones that I reran seem to be fine as far as I can tell, so maybe the best thing to do is just to rerun the entire simulation and then run your compression script. Sorry this is a bit confusing. |
Hmm, okay, that sounds a bit peculiar. It is possible that although the files look fine they still miss something. Could you share the data location here, and point to one file that you think works and one that doesn't, I can try to have a quick look to see if I can see something obvious. |
Hi all, there is one more issue which is that currently the diagnostics fail with the converted files, and @YanchunHe pointed out that it is because of the use of nan's as a fill value (NCO can't handle nan's). I will try to have a look at this soon and commit updated conversion script. |
I have updated the netcdf3to4.py to take care of the FillValues as well now. Essentially, if the original data has a FillValue, it will inherit that, otherwise it will keep it without a FillValue (the previous behavior was to always add nan as FillValue). I also added a fix_FillValue.py script to fix both the FillValue and the unlimited time dimension if one ran the previous version of the netcdf3to4.py. This works, but it doesn't do exactly the same as the the netcdf3to4.py, but rather adds one FillValue depending on the component (sometimes it depends a bit on the variable in the original data, but that doesn't really matter). |
Hi, |
Hi all,
@tto061 @oyvindseland @DirkOlivie @AleksiNummelin @monsieuralok @jgriesfeller @j34ni +++
we are having a discussion with sigma2 support about corrupted files on betzy. I recall that some of you also encountered problems with corrupted files copying from betzy to nird. Is that correct? If so, can you please try to comment on the questions from sigma2 so we can gather all the information and .
Thanks!
Best regards,
Ada
The text was updated successfully, but these errors were encountered: