Cache and remote structure in the next major version #6702
iesahin
started this conversation in
New Features & Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is a rather detailed discussion on the cache and remote structure I have started in #6653. Please look for technical and UI related downsides and difficulties in implementation in this proposal.
Requirements
dvc init
and tracked in Git.Preliminary design
This is preliminary and possible will undergo drastic changes after reviews.
Content directories
Content directories are those that contain the file contents as a whole, in parts or as merged in TAR files.
They are named after the hash value of content or TAR file they contain.
When the file is large, it looks like
When the file is small, the directory looks like
When the file is an archive file, the directory looks like
Content directory hierarchies
These content directories can all reside in hierarchies of various depth:
to
are possible.
When a DVC command looks for a file in the contents, they can either look for the exact directory, or the 2-digit directory that may contain the searched one.
Operation Logs
An
oplog
is similar to Git's reflog, but it contains the operations performed by a DVC repository in a remote.These logs contain
GET
/PUT
/DELETE
instructions, the path of the file in DVC remote and the hash value that shows the content directory. It can be in JSON for easier parsing, here I used a flat structure.These logs are named per-DVC-project-GUID + per-access-id + timestamp. So, when I access to a remote from
192.168.1.1
, with a project GUID ofZZ90-FF11
, it will create two files in/META/logs/
.dvc-begin==ZZ90-FF11==192.168.1.1==189798739837.log
that may contain some information about the transaction, like the command initialized it. This is the first file created by DVC indvc push
and similar operations. It signals the beginning of a transaction. It may also contain a plan for the transaction if its known.dvc-end==ZZ90-FF11==192.168.1.1==189798729837.log
contains all the operations successfully performed by DVC at the end of the transaction.When a
dvc-begin...
file is present the '/META/logs/but not
dvc-end..., it tells the transaction wasn't properly executed. DVC can try to reproduce the command or delete this
dvc-begin-...` file if the operation was cancelled intentionally.Another kind of file, named
dvc-base==...
can be created time to time to list all the previous operations and to list the files. A command likedvc remote fsck
to merge alldvc-begin...
anddvc-end...
files intodvc-base=
files can allow DVC to load the most recentdvc-base=...
and laterdvc-begin=...
anddvc-end=...
files to get all the files available in a remote, with their hash values and paths in the repository.These oplogs are duplicated in all cache and remotes, given the unique filenames. If there are N users of a repository, with M remotes, it's possible to get a list of each of N user's cache status and the files in M remotes without making a network request. So if a certain file is known to be available in remote A, but not remote B, DVC can automatically ask the file from remote A.
TBC
Beta Was this translation helpful? Give feedback.
All reactions