Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Datastore abstraction for Metaflow #580

Merged
merged 123 commits into from
Sep 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
b118685
Add external_field to parameter
romain-intel Jul 28, 2020
dd9cff5
Fix timedate conversion from service to client
romain-intel Jul 28, 2020
7fb4d81
New datastore implementation
romain-intel Jul 20, 2020
1992d07
Local backend for the datastore.
romain-intel Jul 20, 2020
7b24768
Datatools performance improvements
romain-intel Jul 20, 2020
37845f0
Added S3 datastore backend.
romain-intel Jul 21, 2020
610c07f
Pure non-semantic changes (name changes and comment changes).
romain-intel Jul 21, 2020
620d793
Integrate the new datastore implementation into the current code
romain-intel Sep 3, 2020
e6283e0
Remove no longer needed files for the datastore.
romain-intel Jul 21, 2020
644c394
New features for the S3 datatools:
romain-intel Oct 1, 2020
89dd822
Fix metadata to allow for non-string arguments
romain-intel Oct 20, 2020
1a844a2
[WIP] Modify new datastore to use metadata in backend storage
romain-intel Oct 20, 2020
fc59d0b
Properly use download_fileobj when no range is specified
romain-intel Oct 21, 2020
41b28b5
Merge remote-tracking branch 'origin/datatools-improvements' into new…
romain-intel Oct 21, 2020
91aca8d
Re-add intermediate compatibility mode for loading blobs from the CAS
romain-intel Oct 22, 2020
92534f4
Remove print message; fix local_backend in some cases
romain-intel Oct 23, 2020
6d8bc19
[WIP] Datatools: See if moving info to worker makes a performance dif…
romain-intel Oct 23, 2020
4334f3a
Fix issue with s3 datastore backend for multiple files
romain-intel Oct 26, 2020
1a3678c
Fix issue with s3 datastore backend for multiple files
romain-intel Oct 26, 2020
f863203
Addressed comments
romain-intel Dec 1, 2020
06d6e47
Merge tag '2.2.5' into convergence
romain-intel Dec 1, 2020
fd9a90c
Merge branch 'convergence' into new-datastore
romain-intel Dec 1, 2020
e92e242
Merge branch 'new-datastore' into datatools-improvements
romain-intel Dec 1, 2020
ee86d43
Merge branch 'datatools-improvements' into new-datastore-with-meta
romain-intel Dec 1, 2020
64e65bc
Merge branch 'new-datastore-with-meta' into datatools-s3-move-info-to…
romain-intel Dec 1, 2020
46a9ab7
Improve import of metaflow_custom to allow for top-level imports
romain-intel Dec 5, 2020
9e50df6
Fix small issue with sidecar message types
romain-intel Dec 5, 2020
e602b68
Fix tags for attempt-done in task-datastore
romain-intel Dec 5, 2020
83f0a24
Merge branch 'convergence' into new-datastore
romain-intel Dec 5, 2020
c601690
Merge branch 'new-datastore' into datatools-improvements
romain-intel Dec 5, 2020
40ac0b1
Merge branch 'datatools-improvements' into new-datastore-with-meta
romain-intel Dec 5, 2020
f80d8c5
Merge branch 'new-datastore-with-meta' into datatools-s3-move-info-to…
romain-intel Dec 5, 2020
b063cde
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Jan 9, 2021
0e83661
Fix small issue with sidecar message types
romain-intel Dec 5, 2020
83df535
Merge remote-tracking branch 'origin/convergence' into improve-top-le…
romain-intel Jan 9, 2021
b4c9338
Merge branch 'improve-top-level-import' into new-datastore
romain-intel Jan 9, 2021
71b3f06
Merge branch 'new-datastore' into datatools-improvements
romain-intel Jan 9, 2021
10fe74b
Merge branch 'datatools-improvements' into new-datastore-with-meta
romain-intel Jan 9, 2021
b3f9a0e
Merge branch 'new-datastore-with-meta' into datatools-s3-move-info-to…
romain-intel Jan 9, 2021
b06c401
Addressed comments -- minor tweaks and file refactoring
romain-intel Jan 23, 2021
9037ff3
Merge remote-tracking branch 'origin/new-datastore' into datatools-im…
romain-intel Jan 23, 2021
4eef7ca
Merge remote-tracking branch 'origin/datatools-improvements' into new…
romain-intel Jan 23, 2021
9006d25
Addressed comment: removed handling of ephemeral CAS encoding directl…
romain-intel Jan 23, 2021
e415875
Merge remote-tracking branch 'origin/new-datastore-with-meta' into da…
romain-intel Jan 23, 2021
af43f54
Addressed comments; refactored use of pipes
romain-intel Jan 23, 2021
93f04c9
Forgot file in previous commit
romain-intel Jan 25, 2021
af49b96
Merge remote-tracking branch 'origin/new-datastore' into datatools-im…
romain-intel Jan 26, 2021
4598301
Typo in parentheses causing issues with put of multiple files
romain-intel Jan 26, 2021
bcd12c8
Merge remote-tracking branch 'origin/datatools-improvements' into new…
romain-intel Jan 26, 2021
8bf3ac8
Fix issue with S3PutObject
romain-intel Jan 26, 2021
c8588ca
Merge remote-tracking branch 'origin/new-datastore-with-meta' into da…
romain-intel Jan 26, 2021
70b6f71
Fix bugs due to int/string conversion issues with new file mechanism
romain-intel Jan 26, 2021
1a74d8e
Merge pull request #368 from Netflix/datatools-s3-move-info-to-worker
romain-intel Jan 27, 2021
cdcdcaa
Merge pull request #367 from Netflix/new-datastore-with-meta
romain-intel Jan 27, 2021
0b675e5
Merge pull request #351 from Netflix/datatools-improvements
romain-intel Jan 27, 2021
c866cba
Merge pull request #413 from Netflix/new-datastore
romain-intel Jan 27, 2021
3440e79
Merge pull request #412 from Netflix/improve-top-level-import
romain-intel Jan 27, 2021
7df9b7f
Update to 2.2.7
romain-intel Mar 10, 2021
7dc68ee
Fix issue with is_none and py2 vs py3
romain-intel Mar 10, 2021
ff79544
Fix issue with metaflow listing all flows
romain-intel Mar 10, 2021
fce59c4
Handle non-integer task/run IDs in local metadata
romain-intel Mar 10, 2021
218edb7
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Apr 1, 2021
ee32a59
WIP: MFLog on convergence branch
romain-intel Apr 20, 2021
12a5ce6
Fixups to track master MFLog and specific convergence fixups
romain-intel Apr 20, 2021
6368022
Fix to Batch logging
romain-intel Apr 20, 2021
7c78d45
Add support for getting all artifacts from the filecache
romain-intel Apr 27, 2021
9d46029
Improve MFLog:
romain-intel Apr 28, 2021
72eda18
Merge pull request #496 from Netflix/convergence-all-artifacts
romain-intel Apr 30, 2021
9d8e2fd
Fix a bug when missing data is requested from the datastore
romain-intel May 1, 2021
764bc6f
Merge pull request #514 from Netflix/convergence-fix-missing-data
romain-intel May 11, 2021
a4b79b0
Trigger test on convergence branch
romain-intel Jun 5, 2021
8429d32
Merge pull request #479 from Netflix/convergence-mflog
romain-intel Jun 5, 2021
9cf7015
Merge remote-tracking branch 'origin/master' into convergence-master-…
romain-intel Jun 5, 2021
0bc2e97
Merge remote-tracking branch 'origin/master' into convergence-master-…
romain-intel Jun 5, 2021
e2185ef
Fix failing tests
romain-intel Jun 7, 2021
59c533f
Merge pull request #558 from Netflix/convergence-master-merge
romain-intel Jun 7, 2021
868bc8f
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Jun 17, 2021
dd6a747
Fix issue caused by merge_artifacts now having artifacts read from da…
romain-intel Jun 22, 2021
4964173
Convergence master merge (#578)
romain-intel Jun 23, 2021
dfaf408
Revert "Add external_field to parameter"
romain-intel Jun 6, 2021
99e2752
Small MFlog refactor
oavdeev Jun 25, 2021
9b4f8bb
Merge branch 'master' into convergence
romain-intel Jun 25, 2021
0824158
Add a way to pass an external client to S3()
romain-intel Jun 16, 2021
f1a344a
Cache the S3 boto connection in the datastore (#570)
romain-intel Jul 10, 2021
168aa02
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Jul 14, 2021
eccac96
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Jul 16, 2021
8231475
Add an option to define your own AWS client provider (#611)
romain-intel Jul 19, 2021
7a4dc85
Add inputs to task_pre_step (#616)
romain-intel Jul 19, 2021
d29b28a
Tweaks and optimizations to datastore
tuulos Aug 16, 2021
7adf69c
Merge master into convergence
romain-intel Aug 17, 2021
d95d8fc
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Aug 17, 2021
bab3eb8
Fix Py2 compatibility
romain-intel Aug 17, 2021
acc15b5
Fix typo in s3op.py
romain-intel Aug 17, 2021
af729de
Merge branch 'master' into convergence
savingoyal Aug 24, 2021
77c073d
Pin test executions to master branch
savingoyal Aug 24, 2021
6b2141f
Fix names of metadata file to include ending
romain-intel Sep 2, 2021
a4823a5
Renamed metaflow_custom to metaflow_extensions
romain-intel Sep 2, 2021
2ae82a4
Merge branch 'master' into convergence
savingoyal Sep 9, 2021
a811715
Remove duplicate code
savingoyal Sep 9, 2021
d3333a6
remove .json suffix from metadata
savingoyal Sep 9, 2021
3d2ede7
Merge branch 'master' into convergence
savingoyal Sep 9, 2021
5a5358c
test commit
savingoyal Sep 9, 2021
5aaec2d
remove spurious commits
savingoyal Sep 9, 2021
92a73c4
Merge branch 'master' into convergence
savingoyal Sep 10, 2021
154cd90
Fix merge
savingoyal Sep 10, 2021
4640279
remove new line
savingoyal Sep 10, 2021
c460efd
Fixed batch related bug in convergence. (#710)
valayDave Sep 21, 2021
6c8804f
Merge branch 'master' into convergence
savingoyal Sep 21, 2021
667ea84
fix merge conflicts
savingoyal Sep 21, 2021
74566c8
Batch fixes for convergence branch
savingoyal Sep 22, 2021
0ef2c2b
fixup aws batch
savingoyal Sep 22, 2021
ce57b48
fix whitespace
savingoyal Sep 22, 2021
ddd8de3
fix imports
savingoyal Sep 22, 2021
23eba3b
fix newline
savingoyal Sep 22, 2021
7937fde
fix merge
savingoyal Sep 22, 2021
79a972c
fix local metadata for aws batch
savingoyal Sep 22, 2021
34472b2
merge mflog
savingoyal Sep 22, 2021
0532596
stylistic fixes
savingoyal Sep 22, 2021
a75dc47
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Sep 22, 2021
3de05b0
Added preliminary documentation on the datastore (#593)
romain-intel Sep 24, 2021
5985f37
Final set of patches to convergence (#714)
savingoyal Sep 24, 2021
e13fe92
Merge remote-tracking branch 'origin/master' into convergence
romain-intel Sep 24, 2021
67f23d6
Typo in error message
romain-intel Sep 24, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions docs/datastore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Datastore design
## Motivation

The datastore is a crucial part of the Metaflow architecture and deals with
storing and retrieving data, be they artifacts (data produced or consumed within
user steps), logs, metadata information used by Metaflow itself to track execution
or other data like code packages.

One of the key benefits of Metaflow is the ease with which users can access the
data; it is made available to steps of a flow that need it and users can access
it using the Metaflow client API.

This documentation provides a brief overview of Metaflow's datastore implementation
and points out ways in which it can be extended to support, for example, other
storage systems (like GCS instead of S3).

## High-level design

### Design principles
A few principles were followed in designing this datastore. They are listed here
for reference and to help explain some of the choices made.

#### Backward compatibility
The new datastore should be able to read and interact with data stored using
an older implementation of the datastore. While we do not guarantee forward
compatibility, currently, older datastores should be able to read most of the data
stored using the newer datastore.

#### Batch operations
Where possible, APIs are batch friendly and should be used that way. In other
words, it is typically more efficient to call an API once, passing it all the
items to operate on (for example, all the keys to fetch) than to call the same
API multiple times with a single key at a time. All APIs are designed with
batch processing in mind where it makes sense.

#### Separation of responsabilities
Each class implements few functionalities and we attempted to maximize reuse.
The idea is that this will also help in developing newer implementations going
forward and being able to surgically change a few things while keeping most of
the code the same.

### Storage structure
Before going into the design of the datastore itself, it is worth considering
**where** Metaflow stores its information. Note that, in this section, the term
`directory` can also refer to a `prefix` in S3 for example.

Metaflow considers a datastore to have a `datastore_root` which is the base
directory of the datastore. Within that directory, Metaflow will create multiple
sub-directories, one per flow (identified by the name of the flow). Within each
of those directories, Metaflow will create one directory per run as well as
a `data` directory which will contain all the artifacts ever produced by that
flow.

The datastore has several components (starting at the lowest-level):
- a `DataStoreStorage` which abstracts away a storage system (like S3 or
the local filesystem). This provides very simple methods to read and write
bytes, obtain metadata about a file, list a directory as well as minor path
manipulation routines. Metaflow provides sample S3 and local filesystem
implementations. When implementing a new backend, you should only need to
implement the methods defined in `DataStoreStorage` to integrate with the
rest of the Metaflow datastore implementation.
- a `ContentAddressedStore` which implements a thin layer on top of a
`DataStoreStorage` to allow the storing of byte blobs in a content-addressable
manner. In other words, for each `ContentAddressedStore`, identical objects are
stored once and only once, thereby providing some measure of de-duplication.
This class includes the determination of what content is the same or not as well
as any additional encoding/compressing prior to storing the blob in the
`DataStoreStorage`. You can extend this class by providing alternate methods of
packing and unpacking the blob into bytes to be saved.
- a `TaskDataStore` is the main interface through which the rest of Metaflow
interfaces with the datastore. It includes functions around artifacts (
`persisting` (saving) artifacts, loading (getting)), logs and metadata.
- a `FlowDataStore` ties everything together. A `FlowDataStore` will include
a `ContentAddressedStore` and all the `TaskDataStore`s for all the tasks that
are part of the flow. The `FlowDataStore` includes functions to find the
`TaskDataStore` for a given task as well as save and load data directly (
this is used primarily for data that is not tied to a single task, for example
code packages which are more tied to runs).

From the above description, you can see that there is one `ContentAddressedStore`
per flow so artifacts are de-duplicated *per flow* but not across all flows.

## Implementation details

In this section, we will describe each individual class mentioned above in more
detail

### `DataStoreStorage` class

This class implements low-level operations directly interacting with the
file-system (or other storage system such as S3). It exposes a file and
directory like abstraction (with functions such as `path_join`, `path_split`,
`basename`, `dirname` and `is_file`).

Files manipulated at this level are byte objects; the two main functions `save_bytes`
and `load_bytes` operate at the byte level. Additional metadata to save alongside
the file can also be provided as a dictionary. The backend does not parse or
interpret this metadata in any way and simply stores and retrieves it.

The `load_bytes` has a particularity in the sense that it returns an object
`CloseAfterUse` which must be used in a `with` statement. Any bytes loaded
will not be accessible after the `with` statement terminates and so must be
used or copied elsewhere prior to termination of the `with` scope.

### `ContentAddressedStore` class

The content addressed store also handles content as bytes but performs two
additional operations:
- de-duplicates data based on the content of the data (in other words, two
identical blobs of data will only be stored once
- transforms the data prior to storing; we currently only compress the data but
other operations are possible.

Data is always de-duplicated but you can choose to skip the transformation step
by telling the content address store that the data should be stored `raw` (ie:
with no transformation). Note that the de-duplication logic happens *prior* to
any transformation (so the transformation itself will not impact the de-duplication
logic).

Content stored by the content addressed store is addressable using a `key` which is
returned when `save_blobs` is called. `raw` objects can also directly be accessed
using a `uri` (also returned by `save_blobs`); the `uri` will point to the location
of the `raw` bytes in the underlying `DataStoreStorage` (so for exmaple a local
filesystem path or a S3 path). Objects that are not `raw` do not return a `uri`
as they should only be accessed through the content addressed store.

The symmetrical function to `save_blobs` is `load_blobs` which takes a list of
keys (returned by `save_blobs`) and loads all the objects requested. Note that
at this level of abstraction, there is no `metadata` for the blobs; other
mechanisms exist to store, for example, task metadata or information about
artifacts.

#### Implementation detail

The content addressed store contains several (well currently only a pair) of
functions named `_pack_vX` and `_unpack_vX`. They effectively correspond to
the transformations (both transformation to store and reverse transformation
to load) the data undergoes prior to being stored. The `X` corresponds to the
version of the transformation allowing new transformations to be added easily.
A backward compatible `_unpack_backward_compatible` method also allows this
datastore to read any data that was stored with a previous version of the
datastore. Note that going forward, if a new datastore implements `_pack_v2` and
`_unpack_v2`, this datastore would not be able to unpack things packed with
`_pack_v2` but would throw a clear error as to what is happening.

### `TaskDataStore` class

This is the meatiest class and contains most of the functionality that an executing
task will use. The `TaskDataStore` is also used when accessing information and
artifacts through the Metaflow Client.

#### Overview

At a high level, the `TaskDataStore` is responsible for:
- storing artifacts (functions like `save_artifacts`, `persist` help with this)
- storing other metadata about the task execution; this can include logs,
general information about the task, user-level metadata and any other information
the user wishes the persist about the task. Functions for this include
`save_logs` and `save_metadata`. Internally, functions like `done` will
also store information about the task.

Artifacts are stored using the `ContentAddressedStore` that is common to all
tasks in a flow; all other data and metadata is stored using the `DataStoreStorage`
directly at a location indicated by the `pathspec` of the task.

#### Saving artifacts

To save artifacts, the `TaskDataStore` will first pickle the artifacts, thereby
transforming a Python object into bytes. Those bytes will then be passed down
to the `ContentAddressedStore`. In other words, in terms of data transformation:
- Initially you have a pickle-able Python object
- `TaskDataStore` pickles it and transforms it to `bytes`
- Those `bytes` are then de-duped by the `ContentAddressedStore`
- The `ContentAddressedStore` will also gzip the `bytes` and store them
in the storage backend.

Crucially, the `TaskDataStore` takes (and returns when loading artifacts)
Python objects whereas the `ContentAddressedStore` only operates with bytes.

#### Saving metadata and logs

Metadata and logs are stored directly as files using the `DataStoreStorage` to create
and write to a file. The name of the file is something that `TaskDataStore`
determines internally.

### `FlowDataStore` class

The `FlowDataStore` class doesn't do much except give access to `TaskDataStore`
(in effect, it creates the `TaskDataStore` objects to use) and also allows
files to be stored in the `ContentAddressedStore` directly. This is used to
store, for example, code packages. File stored using the `save_data` method
are stored in `raw` format (as in, they are not further compressed). They will,
however, still be de-duped.

### Caching

The datastore allows the inclusion of caching at the `ContentAddressedStore` level:
- for blobs (basically the objects returned by `load_blobs` in the
`ContentAddressedStore`). Objects in this cache have gone through: reading
from the backend storage system and the data transformations in
`ContentAddressedStore`.

The datastore does not determine how and where to cache the data and simply
calls the functions `load_key` and `store_key` on a cache configured by the user
using `set_blob_cache`.
`load_key` is expected to return the object in the cache (if present) or None otherwise.
`store_key` takes a key (the one passed to `load`) and the object to store. The
outside cache is free to implement its own policies and/or own behavior for the
`load_key` and `store_key` functions.

As an example, the `FileCache` uses the `blob_cache` construct to write to
a file anything passed to `store_key` and returns it by reading from the file
when `load_key` is called. The persistence of the file is controlled by the
`FileCache` so an artifact `store_key`ed may vanish from the cache and would
be re-downloaded by the datastore when needed (and then added to the cache
again).
Loading