-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abs store #293
Abs store #293
Conversation
Again, this is for example only, not intended final structure
zarr/storage.py
Outdated
buffer = io.BytesIO() | ||
buffer.write(value) | ||
buffer.seek(0) | ||
self.client.create_blob_from_bytes(self.container_name, blob_name, buffer.read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not make much difference in practice given overheads of network communication etc, but FWIW this could probably be written to avoid some data copies. Best approach would depend on whether create_blob_from_bytes() can only accept a Python bytes object or whether it can accept anything bytes-like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So create_blob_from_bytes does not accept strings, it accepts an array of bytes, unlike the gcs equivalent being used in GCSStore. Although I might be able to make this work by using different blob create functions, depending on whether input is string or bytes like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, looking at the source code for blockblobservice, I think this would minimise memory copies:
if PY2 and isinstance(value, array.array):
value = value.tostring()
blob_name = '/'.join([self.prefix, key])
buffer = io.BytesIO(value)
self.client.create_blob_from_stream(self.container_name, blob_name, buffer)
In create_blob_from_bytes() it is just creating a BytesIO and passing through to create_blob_from_stream(), so you might as well call that directly.
Great to see this moving, had a quick look but give me a shout if you'd like a more careful review at any point. |
Thanks for that review @alimanfoo, I have implemented your suggestion. I wonder what's left now to implement. The |
Second the use of Azurite, it is a helpful addition. On Travis, you do have the ability to use docker, as, for example, some of my Intake- plugins do. I assume appveyor too? It would certainly slow the test run, though, so probably should be on a separate, dedicated run in the matrix. |
I just came here to check in, taking a look at the files changed it's hard to review the work that is specific to this PR because changes from a lot of other commits appear also. I guess this is something that happens when commits from other branches got merged in here but not in the same order as they were merged into master? Does anyone know the git-fu to tidy up this PR branch so we can see a clean diff against master? |
Hi, I made a PR against master, so this is the relevant PR you should be looking at I think: #345 |
* Create an SQLite-backed mutable mapping Implements a key-value store using SQLite. As this is a builtin module in Python and a common database to use in various languages, this should have high utility and be very portable. Not to mention many databases provide an SQLite language on top regardless of the internal representation. So this can be a great template for users wishing to work with Zarr in their preferred database. * Test SQLiteStore Try using the `SQLiteStore` everywhere one would use another store and make sure that it behaves correctly. This includes simple key-value store usage, creating hierarchies, and storing arrays. * Export `SQLiteStore` to the top-level namespace * Include some SQLiteStore examples Provide a few examples of how one might use `SQLiteStore` to store arrays or groups. These examples are taken with minor modifications from the `LMDBStore` examples. * Demonstrate the `SQLiteStore` in the tutorial Includes a simple example borrowed from `LMDBStore`'s tutorial example, which shows how to create and use an `SQLiteStore`. * Provide API documentation for `SQLiteStore` * Make a release note for `SQLiteStore` * Use unique extension for `SQLiteStore` files Otherwise we may end up opening a different databases' files and try to use them with SQLite only to run into errors. This caused the doctests to fail previously. Changing the extension as we have done should avoid these conflicts. * Only close SQLite database when requested Instead of opening, committing, and closing the SQLite database for every operation, limit these to user requested operations. Namely commit only when the user calls `flush`. Also close only when the user calls `close`. This should make operations with SQLite much more performant than when we automatically committed and closed after every user operation. * Update docs to show how to close `SQLiteStore` As users need to explicitly close the `SQLiteStore` to commit changes and serialize them to the SQLite database, make sure to point this out in the docs. * Ensure all SQL commands are capitalized Appears some of these commands work without capitalization. However as the docs show commands as capitalized, ensure that we are doing the same thing as well. That way this won't run into issues with different SQL implementations or older versions of SQLite that are less forgiving. Plus this should match closer to what users familiar with SQL expect. * Simplify `SQLiteStore`'s `__delitem__` using `in` Make use of `in` instead of repeating the same logic in `__delitem__`. As we now keep the database open between operations, this is much simpler than duplicating the key check logic. Also makes it a bit easier to understand what is going on. * Drop no longer needed flake8 error suppression This was needed when the `import` of `sqlite3` was only here to ensure that it existed (even though it wasn't used). Now we make use of `sqlite3` where it is being imported. So there is no need to tell flake8 to not worry about the unused import as there isn't one. * Simplify `close` and use `flush` * Flush before pickling `SQLiteStore` Make sure that everything intended to be added to the `SQLiteStore` database has been written to disk before attempting to pickle it. That way we can be sure other processes constructing their own `SQLiteStore` have access to the same data and not some earlier representation. * Special case in-memory SQLite database No need to normalize the path when there isn't one (e.g. `:memory:`). * Drop unneeded empty `return` statement * Update docs/release.rst Fix a typo. Co-Authored-By: jakirkham <jakirkham@gmail.com> * Update docs/release.rst Include author and original issue in changelog entry. Co-Authored-By: jakirkham <jakirkham@gmail.com> * Correct default value for `check_same_thread` The default value for `check_same_thread` was previously set to `False` when in reality we want this check enabled. So set `check_same_thread` to `True`. * Flush after making any mutation to the database As users could change the setting of things like `check_same_thread` or they may try to access the same database from multiple threads or processes, make sure to flush any changes that would mutate the database. * Skip flushing data when pickling `SQLiteStore` As we now always commit after an operation that mutates the data, there is no need to commit before pickling the `SQLiteStore` object. After all the data should already be up-to-date in the database. * Skip using `flush` in `close` As everything should already be flushed to the database whenever the state is mutated, there is no need to perform this before closing. * Implement `update` for `SQLiteStore` While there is a default implementation of `update` for `MutableMapping`s, it means that we perform multiple `__setitem__` operations. However it would be better if we could commit all key-value pairs in one operation and commit them. Hence we implement `update` for this purpose. * Rewrite `__setitem__` to use `update` Simplifies `__setitem__` to an `update` operation with a dictionary that contains only one item. This works the same as before, but cuts out some redundancy, which simplifies the code a bit. * Disable `check_same_thread` by default again As we now make sure to commit after every mutating change to the database, disable `check_same_thread` again as it should be safe. * Force some parameters to defaults As some of these parameters no longer make sense to be user customizable, go ahead and just set their values as part of the `sqlite3.connect` call. This ensures that they are set the way we expect. Also it ensures that if users try to mess with them, an error will be raised due to duplicate keyword arguments. To elaborate on why these parameters are not user configurable any more, `detect_types` only makes sense if one is building their own table with specific types. Instead we build the table for users and have very generic types (i.e. text and blobs), which are not worth checking. As we commit after every modification to the database to make it more friendly for other threads and processes, the `isolation_level` might as well be to auto-commit. Setting it to anything else really has no effect. Finally there is no need for `check_same_thread` to be anything other than `False` as we are guaranteeing everything is committed after mutation, which ensures the database is thread and process safe. * Drop `flush` calls from `SQLiteStore` As we already enable auto-committing, any mutation is automatically written after performed. So there is no need for us to commit afterwards. Besides `commit` is turned into a no-op if auto-committing is enabled. * Drop the `flush` function from `SQLiteStore` As we auto-commit all changes, there is no need for a `flush` operation for the `SQLiteStore`. So go ahead and drop the `flush` function and its documentation. * Implement optimized `clear` for `SQLiteStore` As the default implementation of `clear` deletes each key-value pair, this will be considerably slower than an operation that can remove all of them at once. Here we do exactly that by using SQL's `DROP TABLE`. Unfortunately there is not a truncate table command, but doing a drop followed by a create has the same effect. We combine these two operations using `executescript`. Thus auto-commit won't run until after both have run, which will commit the table with all key-value pairs removed. * Implement optimized `rmdir` for `SQLiteStore` Provides an SQL implementation of `rmdir` that is a bit better optimized for removing anything that matches the specified path as opposed to doing multiple removals. If it is detected that the root directory is being removed, simply fallback to clear, which is optimized for that use case as it uses `DROP TABLE` instead of deleting rows. Otherwise remove any path that begins with the normalized user-provided path as long as it may contain at least one more character after. This stops `rmdir` from removing a key-value pair where the key exactly matches normalized user-provided path (i.e. not a "directory" as it contains data). * Implement optimized `getsize` for `SQLiteStore` Take advantage of SQLite's ability to query and filter tables quickly to implement `getsize` entirely in SQL (with the exception of path normalization to sanitize user input). Measures the `LENGTH` of all blobs in the column and calls `SUM` to get their aggregate size. In the event that there are no matches, use `COALESCE` to replace the `NULL` value returned by `SUM` with `0` instead. * Implement optimized `listdir` for `SQLiteStore` Take advantage of SQLite's ability to query and filter tables quickly to implement `listdir` entirely in SQL (with the exception of path normalization to sanitize user input). Makes use of a nested `SELECT`/`AS` to build a set of partial keys below top-level key. These are then further split to get only the portion directly under the top-level key and not any of their children. * Implement `rename` for `SQLiteStore` Creates an SQL implementation of `rename` for `SQLiteStore`. As there isn't a way to safely change the keys in `SQLiteStore` (since they are `PRIMARY`), simply create a temporary table that copies over the key-value pairs with keys renamed using a nested `SELECT` statement. Then delete all key-value pairs that match the keys to move. Finally copy all key-value pairs from the temporary table into our table and delete the temporary table. Perform all of this as a transaction so only the final result of the rename is visible to others. * Allow users to specify the SQLite table name Instead of just picking an arbitrary table name for users, allow them to pick a name for the table. Let it default to `zarr` though to make it easy to discover where it got stored if someone inspects the SQLite database. * Randomize temporary table name Use a UUID to generate a unique table name for the temporary table to hopefully avoid collisions even if multiple such operations are occurring and/or remnants of older operations stuck around. * Merge `SELECT`s in `rename` Fuses the two `SELECTS` in `SQLiteStore`'s `rename` function into one. * Tidy `rename` SQL code a bit * Fuse away one `SELECT` in `listdir` In `SQLiteStore`'s `listdir`, fuse the `SELECT` performing the ordering with the `SELECT` applying the `DISTINCT` criteria. As these can be combined and often `DISTINCT` already performs ordering, this may be a bit easier to optimize for different SQL engines. * Only use `k` in `SQLiteStore`'s `__contains__` We don't make use of the values only the keys when checking for existence. So drop the `v` column from the `SELECT` statement as it is unused and only grab the `k` column. * Fuse `SELECT`s in `SQLiteStore`'s `__contains__` Simplifies the SQL used in `SQLiteStore`'s `__contains__` method by fusing the two `SELECT` statements into one. Does this by using `COUNT(*)` to determine how many rows are left after the `SELECT`. As the selection checks for an exact match with the key (and keys are `PRIMARY`), there either is exactly `1` or `0`. So this works the same as `SELECT EXISTS`, but with a single `SELECT` instead. * Cast `has` to `bool` in `SQLiteStore.__contains__` SQLite does not have a boolean type and merely represents them with integers like `0` and `1` (much like C, which it is written in). While Python will perform the conversion of the `__contains__` result to `bool` for us, go ahead and perform the conversion explicitly for clarity. * Prefer using single quotes in more places * Wrap SQL table creation text * Adjust wrapping of `SQLiteStore.clear`'s code * Use parameters for SQL in `listdir` Make sure to use parameters to pass in `path` used by `listdir`'s SQL code to avoid problems caused by injection. * Use parameters for SQL in `getsize` Make sure to use parameters to pass in `path` used by `getsize`'s SQL code to avoid problems caused by injection. * Use parameters for SQL in `rmdir` Make sure to use parameters to pass in `path` used by `rmdir`'s SQL code to avoid problems caused by injection. * Adjust formatting of `SQLiteStore.__contains__` Make sure the command is run in the first line and result stored. Then unpack and return what it finds. * Drop `SQLiteStore`'s implementation of `rename` It's difficult to protect against injections, avoid copying, using a single transaction, etc. in an SQL implementation of `rename`. So instead just drop this implementation and allow the default `rename` implementation to be used. * Just name the SQL table "zarr" Instead of allowing the user to customize where the table is stored, just set it to "zarr". This avoids issues with the table name potentially exploiting injection attacks. Besides its unclear this level of flexibility is really needed given Zarr supports Groups and thus can store many Arrays in the same key-value store. * Unwrap some lines to compact the code a bit * Simplify `SQLiteStore.__contains__` code wrapping * Check SQLite Cursor's rowcount for deletion Instead of checking if a particular key exists and then either raising a `KeyError` or deleting it in `SQLiteStore`, go ahead with the deletion and check the value of `rowcount`. As the keys are primary, they must be unique and thus each one only occurs once. Thus if deletion worked, the `rowcount` will be exactly `1` (it cannot be larger). Alternatively if deletion failed, the `rowcount` would be `0`. Thus we can simply check if the `rowcount` is not `1` and raise the `KeyError`. This should improve the performance a bit. * Parenthesize operations to `?` in SQL To make sure that SQL prioritizes the right things, parenthesize some operations with `?` to clarify to the reader and the parser what should be prioritized. This is done particularly when concatenating special string match symbols to user parameters. * Check `rowcount` for values less than `1` * Parenthesize a few other SQL commands with `?` * Use one line for `SQLiteStore.rmdir`'s SQL * Use 1 line for `SQLiteStore.rmdir`'s SQL & params * Update docs/release.rst Co-Authored-By: jakirkham <jakirkham@gmail.com> * `TestSQLiteStore` -> `TestGroupWithSQLiteStore` * Drop `else` in `for`/`else` for clarity * Ensure SQLite is new enough to enable threading Adds a simple check to ensure SQLite is new enough to enable thread-safe sharing of connections before setting `check_same_thread=True`. If SQLite is not new enough, set `check_same_thread=False`. * Add spacing around `=` * Hold a lock for any DML operations in SQLiteStore As there are some concerns about keeping operations on the SQLite database sequential for thread-safety, acquire an internal lock when a DML operation occurs. This should ensure that only one modification can occur at a time regardless of whether the connection uses the serialized threading mode or not. * Raise when pickling an in-memory SQLite database * Test in-memory SQLiteStore's separately Uses all the same tests we use for SQLiteStore's on disk except it special cases the pickling test to ensure the `SQLiteStore` cannot be pickled if it is in-memory. * Drop explicit setting of `sqlite3` defaults Simply use the `Connection`'s default arguments implicitly instead of explicitly setting them in the constructor. * Adjust inheritance of `TestSQLiteStoreInMemory` Make sure to inherit directly from `unittest.TestCase` as well.
Awesome, thank you, I added a few comments over there. |
Co-Authored-By: shikharsg <shikhar_sg@hotmail.com>
* Chunkwise iteration over arrays. Closes zarr-developers#398. * Fixed lint error from new flake8 version.
* add mongodb and redis stores still needs docs and some CI setup * top level doc strings * fix host kwarg * pickle support * different way of launching dbs on travis * back to default travis configs * fixes to mapping classes for both redis and mongodb stores * default redis port * pep8 * decode for py2? * no decode for py2 * address comments * cast to binary type in mongo getitem * more doc strings * more docs * split release note into two bullets * whitespace fix in .travis.yml * lint after merge * pin mongo/redis versions and a few doc changes * use redis client.delete and check for deleted keys * fix typo in requirements * Update docs/release.rst Co-Authored-By: jhamman <jhamman@ucar.edu> * Update docs/release.rst Co-Authored-By: jhamman <jhamman@ucar.edu> * skip redis/mongodb tests when unable to connect * fix pep8
* basic implementation working * docs and cleanup * fixed client_kwargs bug * Add ABSStore mutable mapping * Fix import syntax * Get open_zarr() working * Change account variable names * Fix client.exists() logging issue * Minor comment changes * Get to_zarr() working * Remove state['container'] delete * Implement rmdir * Add docstring for ABSStore * Remove GCSStore from this branch * Fixed missing argument in getsize of ABStore Was missing self.container_name as an argument * Specified prefix argument in rmdir for abstore * Fixed join string error in dir_path in ABStore Join only accepts one argument, using os.path.join(x,y) formats the string as a valid file path for us. * Remove logging work-around as the issue was fixed in azure-storage 1.3.0 * Clean up docstring * Remove more GCSStore code * Move utility functions into ABSStore class * implemented the rest of the mutable mapping functions. tests pass with python 3.5 * using local blob emulator for storage.ABSStore testing * fixed PY2 array.array error in storage.ABSStore * create test container if not exists in ABSStore test * added more tests for ABSStore * reverted blob client creation to inside of ABSStore * added group test for ABSStore * emulator connection string not needed * fixed import statement location and put azure-storage-blob in requirements * fixed pickle tests * fixed listdir in ABSStore * fixed getsize * Fixed PY2 pickle test. python 2 pickle can't pickle instance methods * implemented the suggestion from here: #293 (comment) * flake-8 fixes * added azure-storage-blob * first attempt at docker build with azurite * azure storage emulator in appveyor * syntax correction * checking if emulator is preinstalled * syntax fix * syntax fix * syntax fix * removed wrong syntax * storage emulator with docker * trying different appveyor image * flake 8 fixes * full coverage * verbose logs for pip install to see appveyor error * trying to run locally installed emulator * single-double quote yaml fix * cmd prefix * double quotes around exe file path * double quotes within single quotes with environment variable substitution * trying appveyor build with VS2015 image ; * added comment and removed verbosity option for pip install * list_abs_directory to list only directory blob using delimiter option in azure blob client * fixed ABSStore docs * fixed windows path listdir error * ABSStore refactoring * moved py2 array.array checking to numcodecs ensure bytes * syntax fix * flake8 fix * fixed ABSStore parameter name container * removed context manager from ABSStore * ABSStore.__delitem__ now takes only 1 azure storage API call * docs * Update zarr/storage.py Co-Authored-By: shikharsg <shikhar_sg@hotmail.com> * removed global import of azure storage library * added ABSStore to zarr root import * added ABSStore to tutorial.rst * fixed docs * trying to fix tutorial.rst * flake8 fix * fixing tutorial.rst * fixed ABSStore in tutorial * docs * small change to docs * cleaned create blob code * flake8 fix * Update docs/release.rst Co-Authored-By: shikharsg <shikhar_sg@hotmail.com> * Apply suggestions from code review Co-Authored-By: shikharsg <shikhar_sg@hotmail.com>
* basic implementation working * docs and cleanup * fixed client_kwargs bug * Add ABSStore mutable mapping * Fix import syntax * Get open_zarr() working * Change account variable names * Fix client.exists() logging issue * Minor comment changes * Get to_zarr() working * Remove state['container'] delete * Implement rmdir * Add docstring for ABSStore * Remove GCSStore from this branch * Fixed missing argument in getsize of ABStore Was missing self.container_name as an argument * Specified prefix argument in rmdir for abstore * Fixed join string error in dir_path in ABStore Join only accepts one argument, using os.path.join(x,y) formats the string as a valid file path for us. * Remove logging work-around as the issue was fixed in azure-storage 1.3.0 * Clean up docstring * Remove more GCSStore code * Move utility functions into ABSStore class * implemented the rest of the mutable mapping functions. tests pass with python 3.5 * using local blob emulator for storage.ABSStore testing * fixed PY2 array.array error in storage.ABSStore * create test container if not exists in ABSStore test * added more tests for ABSStore * reverted blob client creation to inside of ABSStore * added group test for ABSStore * emulator connection string not needed * fixed import statement location and put azure-storage-blob in requirements * fixed pickle tests * fixed listdir in ABSStore * fixed getsize * Fixed PY2 pickle test. python 2 pickle can't pickle instance methods * implemented the suggestion from here: zarr-developers/zarr-python#293 (comment) * flake-8 fixes * added azure-storage-blob * first attempt at docker build with azurite * azure storage emulator in appveyor * syntax correction * checking if emulator is preinstalled * syntax fix * syntax fix * syntax fix * removed wrong syntax * storage emulator with docker * trying different appveyor image * flake 8 fixes * full coverage * verbose logs for pip install to see appveyor error * trying to run locally installed emulator * single-double quote yaml fix * cmd prefix * double quotes around exe file path * double quotes within single quotes with environment variable substitution * trying appveyor build with VS2015 image ; * added comment and removed verbosity option for pip install * list_abs_directory to list only directory blob using delimiter option in azure blob client * fixed ABSStore docs * fixed windows path listdir error * ABSStore refactoring * moved py2 array.array checking to numcodecs ensure bytes * syntax fix * flake8 fix * fixed ABSStore parameter name container * removed context manager from ABSStore * ABSStore.__delitem__ now takes only 1 azure storage API call * docs * Update zarr/storage.py Co-Authored-By: shikharsg <shikhar_sg@hotmail.com> * removed global import of azure storage library * added ABSStore to zarr root import * added ABSStore to tutorial.rst * fixed docs * trying to fix tutorial.rst * flake8 fix * fixing tutorial.rst * fixed ABSStore in tutorial * docs * small change to docs * cleaned create blob code * flake8 fix * Update docs/release.rst Co-Authored-By: shikharsg <shikhar_sg@hotmail.com> * Apply suggestions from code review Co-Authored-By: shikharsg <shikhar_sg@hotmail.com>
[Description of PR]
TODO:
tox -e py36
orpytest -v --doctest-modules zarr
)tox -e py27
orpytest -v zarr
)tox -e py36
orflake8 --max-line-length=100 zarr
)tox -e py36
orpython -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst
)tox -e docs
)