Replies: 2 comments 1 reply
-
Hi, I'd suggest opening this over at https://github.com/Unidata/netcdf4-python. This repo is the core C library; if the issue ends up being something in the core library, and we can demonstrate that with a C program (or a simple enough Python program that I can translate into C), I will be happy to help out. As it stands, I'm afraid my unfamiliarity with Python is a roadblock to providing any help. |
Beta Was this translation helpful? Give feedback.
-
Actually, it probably makes more sense to ask over at xarray (https://github.com/pydata/xarray), since you are using xarray and not netcdf4-python directly. |
Beta Was this translation helpful? Give feedback.
-
My original file (greece_dataset.nc) was ~21.9 GB. I have limited computing power, so I decided to split the file into geographical regions (climate classifications). In split_data.py, I read the data in using xarray, drop static variables, drop values over the ocean, added a region variable, and exported subfiles by region using .to_netcdf. It takes a really long time to write and the file sizes are much bigger - up to 300GB. I then process each subfile (process_class.py), creating two new data variables (spei, heatwave).
When executing both python scripts, I run into issues overloading the memory. I'm submitting them via bash scripts to run on a compute node of a supercomputer with 48 CPUs and ~180GB of memory. I've implemented chunking and various things but am still dealing with inflating file sizes and OOM issues. I've tried logging the memory, deleting unnecessary objects from memory as I code, but I suspect it has to do with inefficient chunking or how I'm exporting the file.
Code for split_data is below.
Description of one of my subfiles:
Beta Was this translation helpful? Give feedback.
All reactions