Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New storage system with auto-magical variable handling #135

Merged
merged 45 commits into from
Aug 4, 2017
Merged

Conversation

Lnaden
Copy link
Contributor

@Lnaden Lnaden commented Mar 7, 2017

This is the new storage system module which handles hard drive IO for arbitrary data sets with minimal user input. Currently handles NetCDF storage, but can be extended to add new storage modules with from the abstract class.

Handles the following types:

  • String
  • Int
  • Float
  • List (any size)
  • Tuple (any size)
  • Quantity (of any of the previous types)
  • Dict (entries can be any of the previous types with string keywords)
  • numpy.ndarray (and Quantity(ndarray))

The StorageInterface module allows auto-magical data IO such that you define your directories and variables on-the-fly with commands such as

StorgeInterfaceInstance.my_directory.my_variable.write(data)

where a folder called my_directory and my_variable are created for you with data being automatically encoded to the storage medium and saved. Accessing it is done with

StorgeInterfaceInstance.my_directory.my_variable.read()

and the data at that same location is fetched and automatically decoded for you. No template nor pre-determining structure required.

@andrrizzi Main review points are the changes to the math_eval function, the main __init__, and the new quantity_from_string function. Functionality of storage code unchanged from the YANK PR, unless there are other issues you see with it.


This is a migration of the code from choderalab/yank#617 to make this module stand separate from YANK as others may want to use it.

Tagging those that have expressed interest in this @bas-rustenburg @gregoryross @ChayaSt, I would appreciate any feedback you have on the code.

Lnaden added 16 commits January 10, 2017 15:47
Refactored the storage system into its own submodule.

Still need to refactor the NCTypeHandler's
Fixed issue where infinite dimension was shared

TODOS:
- Automate Chunk size selection
- Generalize compound type handler
- Ensure handler works for floats, ints as well (they are not mapped)
* updated setup.py to include the yank.storage module
* Reduced complexity of r/w/a checks in StorageInterface
* Finished new scalar variable insertion
* Added basic auto-chunk size selection
…torageIODriver

Simplified some code and redundant properties.
Finished migration from YANK to OpenMMTools for storage module
Added in the quantity from string utility as a special math_eval
Updated readme
Updated __init__ import statements
bumped the version
@Lnaden
Copy link
Contributor Author

Lnaden commented Mar 7, 2017

One problem with this is that NetCDF4 on the main channel is only available on numpy 1.11 and not 1.12 like the one on conda-forge. So this package now sometimes causes numpy downgrades.

Copy link
Contributor

@andrrizzi andrrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! I dug deeper into the mechanisms and add a bunch of minor points. The only important comments are the order of functions.update in math_eval, and the class attributes that may get out-of-date. The rest is just feedback on code readability.

Feel free to merge as soon as you are satisfied.

Name of the file on the disk

"""
return self._file_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could think of returning self._storage_system.file_name here and ditch self._file_name to avoid keeping the same information in two places.

string at initialization.

"""
return self._storage_driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could think of returning self._storage_interface.storage_system here and ditch self._storage_driver to remove redundancy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally created that variable as a protected name so you could not assign it as one of the variables on disk. You are right, I can just derive this and all it did was expose a hidden objects' property. I did a small refactor to remove this and just derive the object.

self._variable = None
# initially undetermined type
self._directory = None
self._variable = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate line.

setup.py Outdated
@@ -14,7 +14,7 @@
DOCLINES = __doc__.split("\n")

########################
VERSION = "0.9.0"
VERSION = "0.9.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What policy do we have about versions? We haven't released 0.9.0 yet, and we don't store every single development version on conda (or do we?), so I think we could bump only after release. This way we won't have "missing" versions in the conda channel.

# Add self to the end
path.extend([self.name]) # Wrap in list or it iterates over the name chars
# Reduce to a path-like string
return '/'.join(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why splitting and joining? Could this work instead?

if self.predecessor is not None:
    return self.predecessor.path + ('/' + self.name)  # the parenthesis just makes it a little faster
else:
    return self.name

# =============================================================================


class NCVariableTypeHandler(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe calling these NCVariableCodec to adopt a consistent nomenclature? Or maybe you could rename the function set_codec to set_type_handler?

"""
Pointer class which provides instructions on how to handle a given nc_variable
"""
def __init__(self, parent_handler, target, storage_object=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be confused here, but could you rename parent_handler to netcdf_driver? And is there a use case in which we want to have a different storage_object than netcdf_driver.ncfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed them to parent_driver to remove confusion.

Yes! The storage object can either be the top level ncfile OR a group/subgroup within that file.

return data


def nc_float_decoder(nc_variable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way you could remove code redundancy here is to use something like:

# This works is for everything but strings.
def dummy_encoder(data):
    return data

# This works for ints and floats.
def scalar_encoder(casting_type):
    def _scalar_encoder(nc_variable):
        data = nc_variable[:]
        if data.shape == (1,):
            data = casting_type(data[0])
        else:
            data = data.astype(casting_type)
        return data
    return _scalar_encoder

'sign': lambda x: np.sign(x)}
if functions is None:
functions = {}
functions.update({'step': lambda x: 1 * (x >= 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to do the opposite?

functions = {'step': ..., 'step_hm': ...}
if user_functions is not None:
    functions.update(user_functions)

or did you wanted the default functions to overwrite a custom step function?

# Get the built-in units
_VALID_UNITS = {method: getattr(unit, method) for method in dir(unit) if type(getattr(unit, method)) is unit.Unit}
# Get the built in unit functions and make sure they are not just types
_VALID_UNIT_FUNCTIONS = {method: getattr(unit, method) for method in dir(unit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can simplify your life, you could also make use of inspect.getmembers(unit, predicate=inspect.isfunction).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, this yields the same result and avoids another import. Its also not too hard to read, so I think I'll keep it.

Lnaden added 6 commits March 10, 2017 10:37
Now uses YAML parsing to store all known values as string, greatly simplifies and extends the codec\ at the cost of some disk efficiency.
Dict now handles nested compound types, including more dicts.
Now supports appending dicts (although not really the best data type to store lots of if you can help it)
Massive overhaul of the NetCDF Dict Codec.
Copy link
Contributor

@andrrizzi andrrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dictionary codec looks good! Added three more comments, but feel free to merge when ready.

def quantity_constructor(loader, node):
loaded_mapping = loader.construct_mapping(node)
data_unit = quantity_from_string(loaded_mapping['NCUnit'])
data_value = loaded_mapping['NCValue']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call these two just unit and value? This way we'll be able to use the same serialization in other formats without tying it to netcdf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting generalizing the dict YAML Loaders/Dumper for future drivers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no this is fine for now, but if in the future we want to do this somewhere else, we'll need a dict to str serialization that has not NC in the name and we'll have to duplicate the code.

Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
This is the property to overwrite the cast dtype
"""
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you just return self.dtype instead of None, you can remove the hybrid property/getter _get_on_disk_dtype and you would still be able to overwrite the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do that. The self.dtype property is how the IODriver identifies what codec to use based purely on the input data. It processes type(data) and reads the codec from the internal database. The problem with the dict type is that its a subset of NCScalar which used to pass self.dtype to the NetCDF variable creation routine, but dict is not a type NetCDF can handle, so I had to come up with a way to define a mappable type in the codec, without breaking the IODriver's auto-data-detection methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I wasn't clear. I'm just suggesting to substitute this:

    @property
    def _get_on_disk_dtype(self):
        """Function to process None for _on_disk_dtype"""
        if self._on_disk_dtype is None:
            return_type = self.dtype
        else:
            return_type = self._on_disk_dtype
        return return_type

    @property
    def _on_disk_dtype(self):
        """
        Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
        This is the property to overwrite the cast dtype
        """
        return None

with a single

    @property
    def _on_disk_dtype(self):
        """
        Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
        This is the property to overwrite the cast dtype
        """
        return self.dtype

i.e. get rid of the hybrid _get_on_disk_dtype property. Unless there's something I'm missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, yes, that I can do

'name': 'four',
'repeated': [4, 4, 4],
'temperature': 4 * unit.kelvin,
'box_vectors': (np.eye(3) * 4.0) * unit.nanometer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a Quantity-wrapped numpy array as a test case here to make sure it works if it's not somewhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 207, already does it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Didn't realize, sorry.

Lnaden added 4 commits March 17, 2017 15:59
Fix duplication, generalize dict yaml to not be NetCDF named
…ata.

Reduced redundant code by adding specialized simple abstract methods
* Read, write, append are no longer abstract methods
* All codecs now use proper _encoder and _decoder properties
Copy link
Contributor

@andrrizzi andrrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I love all the code simplification in the NCVariable classes.

if self._bound_target is None:
self._bind_read()
# Set the output mode by calling the variable
self._output_mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may not need this line (the property doesn't seem to set a member).

self.add_metadata('IODriver_Appendable', 0)
self._dump_metadata_buffer()
# Set the output mode by calling the variable
self._output_mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too (and in few other places).

Refactored more code so _bind_write and _append have common actions grouped
Refactored _bind_read to a common function
Spelling typos
@Lnaden
Copy link
Contributor Author

Lnaden commented May 2, 2017

I've refactored this a bit more to make _bind_read a non-abstract method, also moved parts of _bind_append and _bind_write functions to a common call.

@andrrizzi
Copy link
Contributor

Looks great! I just realized I forgot to import the changes to the dictionary codecs from yank. I'll do it asap.

@jchodera
Copy link
Member

jchodera commented May 2, 2017

Feel free to review and merge this without my involvement!

@Lnaden
Copy link
Contributor Author

Lnaden commented May 2, 2017

There are some bugs I have been working out and @andrrizzi got the review in, once I finish up the last few conflicts (and travis stops fighting me), I'll be able to merge this

Resolve conflict in meta.yaml
@Lnaden
Copy link
Contributor Author

Lnaden commented May 2, 2017

So the windows builds now fail because tqdm and openmm conflict. There is no Windows Python 3.4 cut for tqdm, and I don't think there was one for OpenMM either, the windows tests have been using OpenMM 6.3 from what it looks like. This also strikes me as odd.

@jchodera
Copy link
Member

jchodera commented May 2, 2017

This will be fixed as soon as we add conda-forge as a dependency.

@Lnaden
Copy link
Contributor Author

Lnaden commented May 2, 2017

I'm assuming that will also fix the Linux Python 3.4 build not finishing either, since I assume its tqdm missing again (not on the anaconda channel). A few points though:

  1. The Python 3.4 tests are garbage since they dont build and instead pull packages from the net
  2. Do we really need that progress bar on one test thus requiring a whole new package?
  3. We should drop Python 3.4 support anyways once we get a stable OpenMM on all platforms for python 3.6

@jchodera
Copy link
Member

jchodera commented May 2, 2017

Do we really need that progress bar on one test thus requiring a whole new package?

We can ditch that.

We should drop Python 3.4 support anyways once we get a stable OpenMM on all platforms for python 3.6

I can't stress this enough: We MUST do whatever conda-forge does regarding package support for now. Let's not drop support for anything until conda-forge does.

@Lnaden
Copy link
Contributor Author

Lnaden commented Aug 4, 2017

Okay, this is back into a state it can be merged in. I don't have it handle the compressed, fixed length dictionaries like we do in Yank's dicts when writing to NetCDF, but I don't think we quite need that yet. Did you want to take a look at this to make sure it does all you need to, @jchodera?

@jchodera
Copy link
Member

jchodera commented Aug 4, 2017

Awesome! Thanks for adding docs!

Let's go ahead and merge this in and we can improve it from here.

@Lnaden
Copy link
Contributor Author

Lnaden commented Aug 4, 2017

Will do if this last test does not time out again!

@Lnaden Lnaden merged commit 84d7140 into master Aug 4, 2017
@jchodera jchodera deleted the storage branch March 7, 2018 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants