New storage system with auto-magical variable handling #135

Lnaden · 2017-03-07T19:18:02Z

This is the new storage system module which handles hard drive IO for arbitrary data sets with minimal user input. Currently handles NetCDF storage, but can be extended to add new storage modules with from the abstract class.

Handles the following types:

String
Int
Float
List (any size)
Tuple (any size)
Quantity (of any of the previous types)
Dict (entries can be any of the previous types with string keywords)
numpy.ndarray (and Quantity(ndarray))

The StorageInterface module allows auto-magical data IO such that you define your directories and variables on-the-fly with commands such as

StorgeInterfaceInstance.my_directory.my_variable.write(data)

where a folder called my_directory and my_variable are created for you with data being automatically encoded to the storage medium and saved. Accessing it is done with

StorgeInterfaceInstance.my_directory.my_variable.read()

and the data at that same location is fetched and automatically decoded for you. No template nor pre-determining structure required.

@andrrizzi Main review points are the changes to the math_eval function, the main __init__, and the new quantity_from_string function. Functionality of storage code unchanged from the YANK PR, unless there are other issues you see with it.

This is a migration of the code from choderalab/yank#617 to make this module stand separate from YANK as others may want to use it.

Tagging those that have expressed interest in this @bas-rustenburg @gregoryross @ChayaSt, I would appreciate any feedback you have on the code.

Refactored the storage system into its own submodule. Still need to refactor the NCTypeHandler's

Fixed issue where infinite dimension was shared TODOS: - Automate Chunk size selection - Generalize compound type handler - Ensure handler works for floats, ints as well (they are not mapped)

…ng old version for testing

* updated setup.py to include the yank.storage module * Reduced complexity of r/w/a checks in StorageInterface * Finished new scalar variable insertion * Added basic auto-chunk size selection

…torageIODriver Simplified some code and redundant properties.

… functions

Finished migration from YANK to OpenMMTools for storage module Added in the quantity from string utility as a special math_eval Updated readme Updated __init__ import statements bumped the version

Lnaden · 2017-03-07T20:21:03Z

One problem with this is that NetCDF4 on the main channel is only available on numpy 1.11 and not 1.12 like the one on conda-forge. So this package now sometimes causes numpy downgrades.

andrrizzi

Sweet! I dug deeper into the mechanisms and add a bunch of minor points. The only important comments are the order of functions.update in math_eval, and the class attributes that may get out-of-date. The rest is just feedback on code readability.

Feel free to merge as soon as you are satisfied.

andrrizzi · 2017-03-10T00:30:46Z

openmmtools/storage/storageinterface.py

+            Name of the file on the disk
+
+        """
+        return self._file_name


You could think of returning self._storage_system.file_name here and ditch self._file_name to avoid keeping the same information in two places.

andrrizzi · 2017-03-10T00:38:47Z

openmmtools/storage/storageinterface.py

+            string at initialization.
+
+        """
+        return self._storage_driver


You could think of returning self._storage_interface.storage_system here and ditch self._storage_driver to remove redundancy.

I had originally created that variable as a protected name so you could not assign it as one of the variables on disk. You are right, I can just derive this and all it did was expose a hidden objects' property. I did a small refactor to remove this and just derive the object.

andrrizzi · 2017-03-10T00:39:34Z

openmmtools/storage/storageinterface.py

+        self._variable = None
+        # initially undetermined type
+        self._directory = None
+        self._variable = None


Duplicate line.

andrrizzi · 2017-03-10T00:46:01Z

setup.py

@@ -14,7 +14,7 @@
 DOCLINES = __doc__.split("\n")

 ########################
-VERSION = "0.9.0"
+VERSION = "0.9.1"


What policy do we have about versions? We haven't released 0.9.0 yet, and we don't store every single development version on conda (or do we?), so I think we could bump only after release. This way we won't have "missing" versions in the conda channel.

andrrizzi · 2017-03-10T00:57:12Z

openmmtools/storage/storageinterface.py

+        # Add self to the end
+        path.extend([self.name])  # Wrap in list or it iterates over the name chars
+        # Reduce to a path-like string
+        return '/'.join(path)


Why splitting and joining? Could this work instead?

if self.predecessor is not None: return self.predecessor.path + ('/' + self.name) # the parenthesis just makes it a little faster else: return self.name

andrrizzi · 2017-03-10T02:54:11Z

openmmtools/storage/iodrivers.py

+# =============================================================================
+
+
+class NCVariableTypeHandler(ABC):


Maybe calling these NCVariableCodec to adopt a consistent nomenclature? Or maybe you could rename the function set_codec to set_type_handler?

andrrizzi · 2017-03-10T03:08:32Z

openmmtools/storage/iodrivers.py

+    """
+    Pointer class which provides instructions on how to handle a given nc_variable
+    """
+    def __init__(self, parent_handler, target, storage_object=None):


I may be confused here, but could you rename parent_handler to netcdf_driver? And is there a use case in which we want to have a different storage_object than netcdf_driver.ncfile?

Renamed them to parent_driver to remove confusion.

Yes! The storage object can either be the top level ncfile OR a group/subgroup within that file.

andrrizzi · 2017-03-10T03:20:55Z

openmmtools/storage/iodrivers.py

+    return data
+
+
+def nc_float_decoder(nc_variable):


One way you could remove code redundancy here is to use something like:

# This works is for everything but strings. def dummy_encoder(data): return data # This works for ints and floats. def scalar_encoder(casting_type): def _scalar_encoder(nc_variable): data = nc_variable[:] if data.shape == (1,): data = casting_type(data[0]) else: data = data.astype(casting_type) return data return _scalar_encoder

andrrizzi · 2017-03-10T03:32:52Z

openmmtools/utils.py

-                 'sign': lambda x: np.sign(x)}
+    if functions is None:
+        functions = {}
+    functions.update({'step': lambda x: 1 * (x >= 0),


Did you want to do the opposite?

functions = {'step': ..., 'step_hm': ...} if user_functions is not None: functions.update(user_functions)

or did you wanted the default functions to overwrite a custom step function?

andrrizzi · 2017-03-10T03:35:17Z

openmmtools/utils.py

+# Get the built-in units
+_VALID_UNITS = {method: getattr(unit, method) for method in dir(unit) if type(getattr(unit, method)) is unit.Unit}
+# Get the built in unit functions and make sure they are not just types
+_VALID_UNIT_FUNCTIONS = {method: getattr(unit, method) for method in dir(unit)


If it can simplify your life, you could also make use of inspect.getmembers(unit, predicate=inspect.isfunction).

Eh, this yields the same result and avoids another import. Its also not too hard to read, so I think I'll keep it.

Now uses YAML parsing to store all known values as string, greatly simplifies and extends the codec\ at the cost of some disk efficiency. Dict now handles nested compound types, including more dicts. Now supports appending dicts (although not really the best data type to store lots of if you can help it)

Massive overhaul of the NetCDF Dict Codec.

andrrizzi

The new dictionary codec looks good! Added three more comments, but feel free to merge when ready.

andrrizzi · 2017-03-17T16:30:11Z

openmmtools/storage/iodrivers.py

+    def quantity_constructor(loader, node):
+        loaded_mapping = loader.construct_mapping(node)
+        data_unit = quantity_from_string(loaded_mapping['NCUnit'])
+        data_value = loaded_mapping['NCValue']


Maybe we should call these two just unit and value? This way we'll be able to use the same serialization in other formats without tying it to netcdf.

Are you suggesting generalizing the dict YAML Loaders/Dumper for future drivers?

No no this is fine for now, but if in the future we want to do this somewhere else, we'll need a dict to str serialization that has not NC in the name and we'll have to duplicate the code.

andrrizzi · 2017-03-17T16:32:35Z

openmmtools/storage/iodrivers.py

+        Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk
+        This is the property to overwrite the cast dtype
+        """
+        return None


If you just return self.dtype instead of None, you can remove the hybrid property/getter _get_on_disk_dtype and you would still be able to overwrite the behavior.

Can't do that. The self.dtype property is how the IODriver identifies what codec to use based purely on the input data. It processes type(data) and reads the codec from the internal database. The problem with the dict type is that its a subset of NCScalar which used to pass self.dtype to the NetCDF variable creation routine, but dict is not a type NetCDF can handle, so I had to come up with a way to define a mappable type in the codec, without breaking the IODriver's auto-data-detection methods.

Sorry, I think I wasn't clear. I'm just suggesting to substitute this:

@property def _get_on_disk_dtype(self): """Function to process None for _on_disk_dtype""" if self._on_disk_dtype is None: return_type = self.dtype else: return_type = self._on_disk_dtype return return_type @property def _on_disk_dtype(self): """ Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk This is the property to overwrite the cast dtype """ return None

with a single

@property def _on_disk_dtype(self): """ Allow overwriting the dtype for storage for extending this method to cast data as a different type on disk This is the property to overwrite the cast dtype """ return self.dtype

i.e. get rid of the hybrid _get_on_disk_dtype property. Unless there's something I'm missing.

I see now, yes, that I can do

andrrizzi · 2017-03-17T16:36:11Z

openmmtools/tests/test_storage_iodrivers.py

+        'name': 'four',
+        'repeated': [4, 4, 4],
+        'temperature': 4 * unit.kelvin,
+        'box_vectors': (np.eye(3) * 4.0) * unit.nanometer


Can you add a Quantity-wrapped numpy array as a test case here to make sure it works if it's not somewhere else?

Line 207, already does it!

Doh! Didn't realize, sorry.

Fix duplication, generalize dict yaml to not be NetCDF named

…ata. Reduced redundant code by adding specialized simple abstract methods * Read, write, append are no longer abstract methods * All codecs now use proper _encoder and _decoder properties

…ork again

Fix tests against dicts Fix typo in tests

andrrizzi

Awesome! I love all the code simplification in the NCVariable classes.

andrrizzi · 2017-05-02T14:46:26Z

openmmtools/storage/iodrivers.py

+        if self._bound_target is None:
+            self._bind_read()
+        # Set the output mode by calling the variable
+        self._output_mode


You may not need this line (the property doesn't seem to set a member).

andrrizzi · 2017-05-02T15:01:10Z

openmmtools/storage/iodrivers.py

+            self.add_metadata('IODriver_Appendable', 0)
+        self._dump_metadata_buffer()
+        # Set the output mode by calling the variable
+        self._output_mode


Here too (and in few other places).

Refactored more code so _bind_write and _append have common actions grouped Refactored _bind_read to a common function Spelling typos

Lnaden · 2017-05-02T16:22:52Z

I've refactored this a bit more to make _bind_read a non-abstract method, also moved parts of _bind_append and _bind_write functions to a common call.

andrrizzi · 2017-05-02T16:35:29Z

Looks great! I just realized I forgot to import the changes to the dictionary codecs from yank. I'll do it asap.

jchodera · 2017-05-02T18:17:35Z

Feel free to review and merge this without my involvement!

Lnaden · 2017-05-02T18:19:44Z

There are some bugs I have been working out and @andrrizzi got the review in, once I finish up the last few conflicts (and travis stops fighting me), I'll be able to merge this

Resolve conflict in meta.yaml

Lnaden · 2017-05-02T19:34:11Z

So the windows builds now fail because tqdm and openmm conflict. There is no Windows Python 3.4 cut for tqdm, and I don't think there was one for OpenMM either, the windows tests have been using OpenMM 6.3 from what it looks like. This also strikes me as odd.

jchodera · 2017-05-02T19:35:49Z

This will be fixed as soon as we add conda-forge as a dependency.

Lnaden · 2017-05-02T19:45:48Z

I'm assuming that will also fix the Linux Python 3.4 build not finishing either, since I assume its tqdm missing again (not on the anaconda channel). A few points though:

The Python 3.4 tests are garbage since they dont build and instead pull packages from the net
Do we really need that progress bar on one test thus requiring a whole new package?
We should drop Python 3.4 support anyways once we get a stable OpenMM on all platforms for python 3.6

jchodera · 2017-05-02T20:18:31Z

Do we really need that progress bar on one test thus requiring a whole new package?

We can ditch that.

We should drop Python 3.4 support anyways once we get a stable OpenMM on all platforms for python 3.6

I can't stress this enough: We MUST do whatever conda-forge does regarding package support for now. Let's not drop support for anything until conda-forge does.

DictYamlLoader/Dumper explicitly handle ndarrays

Resolved conflicts there of

Redid some names Added doc page (untested)

…me very glaring ones and the storage ones.

Lnaden · 2017-08-04T17:27:10Z

Okay, this is back into a state it can be merged in. I don't have it handle the compressed, fixed length dictionaries like we do in Yank's dicts when writing to NetCDF, but I don't think we quite need that yet. Did you want to take a look at this to make sure it does all you need to, @jchodera?

jchodera · 2017-08-04T17:42:33Z

Awesome! Thanks for adding docs!

Let's go ahead and merge this in and we can improve it from here.

Lnaden · 2017-08-04T17:44:22Z

Will do if this last test does not time out again!

Lnaden added 16 commits January 10, 2017 15:47

Made a number of documentation changes to the storage module

21b65eb

Refactored the storage system into its own submodule. Still need to refactor the NCTypeHandler's

Created initial generic scalar type handler, still a bit sloppy

3f9cf34

Fixed issue where infinite dimension was shared TODOS: - Automate Chunk size selection - Generalize compound type handler - Ensure handler works for floats, ints as well (they are not mapped)

Implemented the new Int Float and Str coders, commiting before removi…

77b5484

…ng old version for testing

Storage module fixes should all be in now

ee41022

* updated setup.py to include the yank.storage module * Reduced complexity of r/w/a checks in StorageInterface * Finished new scalar variable insertion * Added basic auto-chunk size selection

Fixed metadata addition errors

163132a

Fixed typo in type_string vs dtype_string

11271e3

Refactored the StorageInterface to be a wrapper on the an instanced S…

103da06

…torageIODriver Simplified some code and redundant properties.

Fixed bugs in code and tests from refactor

68dd477

moved the storage files to thier own location

ed5a59a

Migrating the new storage module from YANK to this repo

0da92e4

Moved the new files into their new home. NOT stable yet, missing some…

0a07912

… functions

Merge branch 'master' into storage

47f78ed

Added in the storage tests

cc455f4

Finished migration from YANK to OpenMMTools for storage module Added in the quantity from string utility as a special math_eval Updated readme Updated __init__ import statements bumped the version

Add the NetCDF4 Requirement to the meta.yaml file

657a030

Add storage as part of the setup module

cf94fd0

fix doc tests

a778571

Fix Python 3 compatibility in casting type -> string

9de2805

andrrizzi approved these changes Mar 10, 2017

View reviewed changes

Lnaden added 6 commits March 10, 2017 10:37

Lots of cleanups

730d72f

Doc string cleanup

8eb54c8

A couple more docs cleanup

9fa099e

Merge branch 'master' into storage to get conda build fix

0809599

Merge pull request #137 from choderalab/yaml-storage

9eaff29

Massive overhaul of the NetCDF Dict Codec.

andrrizzi approved these changes Mar 17, 2017

View reviewed changes

Lnaden added 4 commits March 17, 2017 15:59

Fix duplication, generalize dict yaml to not be NetCDF named

9444c10

Merge pull request #139 from choderalab/yaml-storage

4e4770a

Fix duplication, generalize dict yaml to not be NetCDF named

Add the "at_index" feature to write to specific index of appendable d…

e8d1497

…ata. Reduced redundant code by adding specialized simple abstract methods * Read, write, append are no longer abstract methods * All codecs now use proper _encoder and _decoder properties

Update StorageInterface to support the new to_index feature

92b52f5

Lnaden added 3 commits April 27, 2017 16:27

Fix Type checking, more missing bits from refactor, doctests should w…

9e05229

…ork again

Actually fix type checking

e2fe1b0

Fix append type checking

5b401a0

Fix tests against dicts Fix typo in tests

andrrizzi reviewed May 2, 2017

View reviewed changes

Removed old redundant check for output_mode

5926897

Refactored more code so _bind_write and _append have common actions grouped Refactored _bind_read to a common function Spelling typos

Fix missing line in Quantity _bind_append for save_shape

9a7ce68

merging master into storage

0ceee07

Resolve conflict in meta.yaml

andrrizzi mentioned this pull request May 2, 2017

Improve PME and RF support in alchemy #169

Merged

andrrizzi and others added 2 commits May 4, 2017 09:46

DictYamlLoader/Dumper explicitly handle ndarrays

7fea40a

Merge pull request #178 from choderalab/dict-loader-dumper

f9d28d3

DictYamlLoader/Dumper explicitly handle ndarrays

Lnaden mentioned this pull request Jun 5, 2017

Added conda-forge as a channel requirement #216

Merged

jchodera mentioned this pull request Aug 2, 2017

[WIP] SAMS samplers and documentation #214

Open

2 tasks

Lnaden added 5 commits August 3, 2017 17:27

Merge branch 'master' of github.com:choderalab/openmmtools into storage

6cabeb8

Resolved conflicts there of

Cleanup of storage with some docs

f16feaf

Redid some names Added doc page (untested)

Merge branch 'storage' of github.com:choderalab/openmmtools into storage

53c1e07

Fix the dict representation in the storage module

355a65a

Fix some doc strings. There are more that need done, but this gets so…

4ee4785

…me very glaring ones and the storage ones.

Lnaden merged commit 84d7140 into master Aug 4, 2017

jchodera deleted the storage branch March 7, 2018 21:46

		# =============================================================================


		class NCVariableTypeHandler(ABC):

New storage system with auto-magical variable handling #135

New storage system with auto-magical variable handling #135

Conversation

Lnaden commented Mar 7, 2017 • edited Loading

Lnaden commented Mar 7, 2017

andrrizzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrrizzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrrizzi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lnaden commented May 2, 2017

andrrizzi commented May 2, 2017

jchodera commented May 2, 2017

Lnaden commented May 2, 2017

Lnaden commented May 2, 2017

jchodera commented May 2, 2017

Lnaden commented May 2, 2017

jchodera commented May 2, 2017

Lnaden commented Aug 4, 2017

jchodera commented Aug 4, 2017

Lnaden commented Aug 4, 2017

Lnaden commented Mar 7, 2017 •

edited

Loading