-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New storage system with auto-magical variable handling #135
Conversation
Refactored the storage system into its own submodule. Still need to refactor the NCTypeHandler's
Fixed issue where infinite dimension was shared TODOS: - Automate Chunk size selection - Generalize compound type handler - Ensure handler works for floats, ints as well (they are not mapped)
…ng old version for testing
* updated setup.py to include the yank.storage module * Reduced complexity of r/w/a checks in StorageInterface * Finished new scalar variable insertion * Added basic auto-chunk size selection
…torageIODriver Simplified some code and redundant properties.
Finished migration from YANK to OpenMMTools for storage module Added in the quantity from string utility as a special math_eval Updated readme Updated __init__ import statements bumped the version
One problem with this is that NetCDF4 on the main channel is only available on numpy 1.11 and not 1.12 like the one on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! I dug deeper into the mechanisms and add a bunch of minor points. The only important comments are the order of functions.update
in math_eval
, and the class attributes that may get out-of-date. The rest is just feedback on code readability.
Feel free to merge as soon as you are satisfied.
Name of the file on the disk | ||
|
||
""" | ||
return self._file_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could think of returning self._storage_system.file_name
here and ditch self._file_name
to avoid keeping the same information in two places.
string at initialization. | ||
|
||
""" | ||
return self._storage_driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could think of returning self._storage_interface.storage_system
here and ditch self._storage_driver
to remove redundancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had originally created that variable as a protected name so you could not assign it as one of the variables on disk. You are right, I can just derive this and all it did was expose a hidden objects' property. I did a small refactor to remove this and just derive the object.
self._variable = None | ||
# initially undetermined type | ||
self._directory = None | ||
self._variable = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate line.
setup.py
Outdated
@@ -14,7 +14,7 @@ | |||
DOCLINES = __doc__.split("\n") | |||
|
|||
######################## | |||
VERSION = "0.9.0" | |||
VERSION = "0.9.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What policy do we have about versions? We haven't released 0.9.0 yet, and we don't store every single development version on conda (or do we?), so I think we could bump only after release. This way we won't have "missing" versions in the conda channel.
# Add self to the end | ||
path.extend([self.name]) # Wrap in list or it iterates over the name chars | ||
# Reduce to a path-like string | ||
return '/'.join(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why splitting and joining? Could this work instead?
if self.predecessor is not None:
return self.predecessor.path + ('/' + self.name) # the parenthesis just makes it a little faster
else:
return self.name
openmmtools/storage/iodrivers.py
Outdated
# ============================================================================= | ||
|
||
|
||
class NCVariableTypeHandler(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe calling these NCVariableCodec
to adopt a consistent nomenclature? Or maybe you could rename the function set_codec
to set_type_handler
?
openmmtools/storage/iodrivers.py
Outdated
""" | ||
Pointer class which provides instructions on how to handle a given nc_variable | ||
""" | ||
def __init__(self, parent_handler, target, storage_object=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be confused here, but could you rename parent_handler
to netcdf_driver
? And is there a use case in which we want to have a different storage_object
than netcdf_driver.ncfile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed them to parent_driver
to remove confusion.
Yes! The storage object can either be the top level ncfile
OR a group/subgroup within that file.
openmmtools/storage/iodrivers.py
Outdated
return data | ||
|
||
|
||
def nc_float_decoder(nc_variable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way you could remove code redundancy here is to use something like:
# This works is for everything but strings.
def dummy_encoder(data):
return data
# This works for ints and floats.
def scalar_encoder(casting_type):
def _scalar_encoder(nc_variable):
data = nc_variable[:]
if data.shape == (1,):
data = casting_type(data[0])
else:
data = data.astype(casting_type)
return data
return _scalar_encoder
openmmtools/utils.py
Outdated
'sign': lambda x: np.sign(x)} | ||
if functions is None: | ||
functions = {} | ||
functions.update({'step': lambda x: 1 * (x >= 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you want to do the opposite?
functions = {'step': ..., 'step_hm': ...}
if user_functions is not None:
functions.update(user_functions)
or did you wanted the default functions to overwrite a custom step
function?
# Get the built-in units | ||
_VALID_UNITS = {method: getattr(unit, method) for method in dir(unit) if type(getattr(unit, method)) is unit.Unit} | ||
# Get the built in unit functions and make sure they are not just types | ||
_VALID_UNIT_FUNCTIONS = {method: getattr(unit, method) for method in dir(unit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can simplify your life, you could also make use of inspect.getmembers(unit, predicate=inspect.isfunction)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, this yields the same result and avoids another import. Its also not too hard to read, so I think I'll keep it.
Now uses YAML parsing to store all known values as string, greatly simplifies and extends the codec\ at the cost of some disk efficiency. Dict now handles nested compound types, including more dicts. Now supports appending dicts (although not really the best data type to store lots of if you can help it)
Massive overhaul of the NetCDF Dict Codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new dictionary codec looks good! Added three more comments, but feel free to merge when ready.
openmmtools/storage/iodrivers.py
Outdated
def quantity_constructor(loader, node): | ||
loaded_mapping = loader.construct_mapping(node) | ||
data_unit = quantity_from_string(loaded_mapping['NCUnit']) | ||
data_value = loaded_mapping['NCValue'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call these two just unit
and value
? This way we'll be able to use the same serialization in other formats without tying it to netcdf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting generalizing the dict YAML Loaders/Dumper for future drivers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no this is fine for now, but if in the future we want to do this somewhere else, we'll need a dict to str serialization that has not NC
in the name and we'll have to duplicate the code.
openmmtools/storage/iodrivers.py
Outdated
Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk | ||
This is the property to overwrite the cast dtype | ||
""" | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you just return self.dtype
instead of None, you can remove the hybrid property/getter _get_on_disk_dtype
and you would still be able to overwrite the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't do that. The self.dtype
property is how the IODriver identifies what codec to use based purely on the input data. It processes type(data)
and reads the codec from the internal database. The problem with the dict
type is that its a subset of NCScalar
which used to pass self.dtype
to the NetCDF variable creation routine, but dict
is not a type NetCDF can handle, so I had to come up with a way to define a mappable type in the codec, without breaking the IODriver's auto-data-detection methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I think I wasn't clear. I'm just suggesting to substitute this:
@property
def _get_on_disk_dtype(self):
"""Function to process None for _on_disk_dtype"""
if self._on_disk_dtype is None:
return_type = self.dtype
else:
return_type = self._on_disk_dtype
return return_type
@property
def _on_disk_dtype(self):
"""
Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
This is the property to overwrite the cast dtype
"""
return None
with a single
@property
def _on_disk_dtype(self):
"""
Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
This is the property to overwrite the cast dtype
"""
return self.dtype
i.e. get rid of the hybrid _get_on_disk_dtype
property. Unless there's something I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, yes, that I can do
'name': 'four', | ||
'repeated': [4, 4, 4], | ||
'temperature': 4 * unit.kelvin, | ||
'box_vectors': (np.eye(3) * 4.0) * unit.nanometer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a Quantity
-wrapped numpy array as a test case here to make sure it works if it's not somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 207, already does it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! Didn't realize, sorry.
Fix duplication, generalize dict yaml to not be NetCDF named
…ata. Reduced redundant code by adding specialized simple abstract methods * Read, write, append are no longer abstract methods * All codecs now use proper _encoder and _decoder properties
Fix tests against dicts Fix typo in tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I love all the code simplification in the NCVariable
classes.
openmmtools/storage/iodrivers.py
Outdated
if self._bound_target is None: | ||
self._bind_read() | ||
# Set the output mode by calling the variable | ||
self._output_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may not need this line (the property doesn't seem to set a member).
openmmtools/storage/iodrivers.py
Outdated
self.add_metadata('IODriver_Appendable', 0) | ||
self._dump_metadata_buffer() | ||
# Set the output mode by calling the variable | ||
self._output_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too (and in few other places).
Refactored more code so _bind_write and _append have common actions grouped Refactored _bind_read to a common function Spelling typos
I've refactored this a bit more to make |
Looks great! I just realized I forgot to import the changes to the dictionary codecs from yank. I'll do it asap. |
Feel free to review and merge this without my involvement! |
There are some bugs I have been working out and @andrrizzi got the review in, once I finish up the last few conflicts (and travis stops fighting me), I'll be able to merge this |
Resolve conflict in meta.yaml
So the windows builds now fail because |
This will be fixed as soon as we add conda-forge as a dependency. |
I'm assuming that will also fix the Linux Python 3.4 build not finishing either, since I assume its
|
We can ditch that.
I can't stress this enough: We MUST do whatever conda-forge does regarding package support for now. Let's not drop support for anything until conda-forge does. |
DictYamlLoader/Dumper explicitly handle ndarrays
Resolved conflicts there of
Redid some names Added doc page (untested)
…me very glaring ones and the storage ones.
Okay, this is back into a state it can be merged in. I don't have it handle the compressed, fixed length dictionaries like we do in Yank's dicts when writing to NetCDF, but I don't think we quite need that yet. Did you want to take a look at this to make sure it does all you need to, @jchodera? |
Awesome! Thanks for adding docs! Let's go ahead and merge this in and we can improve it from here. |
Will do if this last test does not time out again! |
This is the new storage system module which handles hard drive IO for arbitrary data sets with minimal user input. Currently handles NetCDF storage, but can be extended to add new storage modules with from the abstract class.
Handles the following types:
The
StorageInterface
module allows auto-magical data IO such that you define your directories and variables on-the-fly with commands such asStorgeInterfaceInstance.my_directory.my_variable.write(data)
where a folder called
my_directory
andmy_variable
are created for you withdata
being automatically encoded to the storage medium and saved. Accessing it is done withStorgeInterfaceInstance.my_directory.my_variable.read()
and the data at that same location is fetched and automatically decoded for you. No template nor pre-determining structure required.
@andrrizzi Main review points are the changes to the
math_eval
function, the main__init__
, and the newquantity_from_string
function. Functionality of storage code unchanged from the YANK PR, unless there are other issues you see with it.This is a migration of the code from choderalab/yank#617 to make this module stand separate from YANK as others may want to use it.
Tagging those that have expressed interest in this @bas-rustenburg @gregoryross @ChayaSt, I would appreciate any feedback you have on the code.