Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Datastore abstraction for Metaflow #580

Merged
merged 123 commits into from
Sep 24, 2021
Merged

New Datastore abstraction for Metaflow #580

merged 123 commits into from
Sep 24, 2021

Conversation

romain-intel
Copy link
Contributor

@romain-intel romain-intel commented Jun 22, 2021

Testing in progress...

Detailed notes to follow.

This commit contains only the new datastore code and none of the backend implementations.
The datastore is now split into different files:
  - flow_datastore.py contains the top-level FlowDataStore implementation
  - task_datastore.py contains the task-level datastore implementation
  - content_addressed_store.py contains the underlying content addressed store used by
    both previous data-stores.
The local backend is used to save and load local files.
Datatools will now cache the s3 client it uses for single operations
resulting in faster operation times.

Another optimization (non advertised) is that datatools can now take
an IOBase directly to avoid having an additional copy.

Finally, __del__ now performs a close so the S3 datatool can be used
as a regular object as opposed to just within a context.
This backend allows the datastore to interface with S3.
One tiny semantic change in the way a Tar file is read (using the recommended
open method instead of TarFile object)
  - support for range queries (in get and get_many)
  - support for content-type and user metadata in put, put_many and put_files
    (metadata can also be retrieved using any of the get calls)
  - support for info and info_many to retrieve information about a file without
    fetching it.
Instead of encoding needed information directly in the file, we now
encode this information as file metadata leveraging the support for metadata
for the S3 datatools and implementing support for it in the local
filesystem (creating a separate file)
savingoyal and others added 14 commits September 22, 2021 07:53
* Added preliminary documentation on the datastore

* Update datastore.md

Update with new name for the backend.
* convergence fixes

* fixes

* gzip ts changes

* fix logs subcommand

* typo

* Address comments; add support for var_transform in bash_capture_logs

* Fix local metadata sync

* Forgot to remove duplicate sync_metadata

Co-authored-by: Romain Cledat <rcledat@netflix.com>
Co-authored-by: Romain <romain-intel@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants