Skip to content

use HDF5 files for large arrays in spa and irradiance #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mikofski opened this issue Aug 11, 2016 · 2 comments
Closed

use HDF5 files for large arrays in spa and irradiance #236

mikofski opened this issue Aug 11, 2016 · 2 comments

Comments

@mikofski
Copy link
Member

The very large arrays in SPA and irradiance modules take a lot of space, which IMO makes those modules hard to navigate. See #235. Also having them in code can present other issues, for example if those coefficients should be changed or expanded. EG: If new sets of Perez coefficients are released.

Some proposals:

  1. Move data to the bottom of the module, and make it constant. When modules are imported, first only the top level symbols are interpreted, so the module attribute MYDATA will be interpreted before the class attribute mydata and won't raise a NameError for an unresolved reference.
import numpy as np

class ClsUsingData(objects):
    mydata = MYDATA

    def __init__(self, *args):
        # do stuff with data

# other stuff

# all constants with very large arrays at the bottom of module
MYDATA = np.arrary([
    # lots of data
])
  1. Use HDF5 files using h5py. These files are highly optimized for speed and act exactly like NumPy arrays. It's okay to keep them open, they'll be closed when Python exits. HDF5 will quickly load the data only when sliced (using mp threads if h5py built with mpicc or openmp) so memory usage is faster and more efficient. Alternately, copying all of the data out of the file into a numpy array will allow you to close the file, but it is slower and less efficient.
import h5py
import os
DIRNAME = os.path.dirname(__file__)
MYDATA = os.path.join(DIRNAME, 'mydata.h5')

class ClsUsingData(objects):
    mydata = h5py.File(MYDATA)  # leave it open

# alternately copy the data to a numpy array and close the file:
# h5_data_path = '/group/dataset'
# with h5py.File(MYDATA, 'r') as f:
#     mydata = np.array(f[h5_data_path])

    def __init__(self, *args):
        # do stuff with data
@wholmgren
Copy link
Member

My vote is for moving the spa data to the end of the file. The h5 file sounds like overkill and may cause problems for the numba solarposition code or anything else that people do to multithread/process things. The coefficients are closely related to the code, so I don't see a problem having them in the module so long as they're not too long. You and I may differ on our definition of too long, though.

@KonstantinTr
Copy link
Contributor

I used to store data in h5 files and every time something goes wrong while writing changes - you can't read this file any more. More data you store within this file - more space you need for backup before opening h5 file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants