config: set cache per project at a global/system level #5947
Replies: 13 comments
-
@wdixon thanks! Few questions here:
|
Beta Was this translation helpful? Give feedback.
-
We are pretty new to DVC - so perhaps there is a better way to do things... Currently, each project saves the remote definition in the .dvc/config... The way we envision using these - is defining a curated dataset that is effectively a "data registry"... A separate git/dvc project would define a particular network architecture and import one or more subsets of data registry projects required to do the training and track the model provenance. We would like to avoid having naming conventions, etc... that provide the mapping - that work on one system - and would break when those mappings/links don't exist on another system. I suppose multiple remotes complicates things - that is something I hadn't considered.... maybe you would have to adopt a naming conventions for projects themselves - and allow the global configuration to have project/pattern matching to allow the assignment of things like cach.dir git uses a similar approach for selecting proxies... You can configure different URL patterns to use different settings. That is the basis of how I proposed the remote matching.... but maybe if you swap remote url for project name it would be a clean solution? |
Beta Was this translation helpful? Give feedback.
-
thanks @wdixon ! It's totally reasonable to have multiple remotes. We even have a ticket prioritized to have multiple remotes per project- it's a common ask due to a different nature of artifacts (e.g. sensitive data vs models). Having separate caches is also reasonable to my mind. Even though
agreed! was suggesting this as a workaround only
it can be a good idea ... like have
could you point me to the docs? |
Beta Was this translation helpful? Give feedback.
-
@wdixon @shcheklein I think I still don't quite get the situation.
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
got it, but then, even if we were supporting Ahh... Ok I think I get it now: |
Beta Was this translation helpful? Give feedback.
-
https://git-scm.com/docs/git-config, and also with their credential interface https://git-scm.com/docs/gitcredentials |
Beta Was this translation helpful? Give feedback.
-
Was thinking that this would be in the users global config or system config. If its not defined, the users would end up with their own private cache under ./dvc/cache - so still workable with just a checkout/pull (not broken) - but would require configuration to point to a shared cache.
Yes - that was the thought.... but project name could also work as long as we kept the name unique. |
Beta Was this translation helpful? Give feedback.
-
If this were a configuration based on project name.... That would imply you would need to define a name (perhaps in the core namespace?) dvc/.config
[core]
name = "some_name"
global or system dvc config:
[cache "some_n*"]
dir = /path/to/cache |
Beta Was this translation helpful? Give feedback.
-
@wdixon yes, that's pretty much what I had in my mind! name can be optional, so people would use only when it's needed. |
Beta Was this translation helpful? Give feedback.
-
Hello! Would #2095 address this? (slightly different solution). If so maybe we can merge the issues. |
Beta Was this translation helpful? Give feedback.
-
#2095 seems to be about defining a configuration to override the bucket... #4519 is about having the ability to define where shared caches, containing machine specific paths, might be associated with a given repo or set of repos. My initial thought was based on bucket or bucket pattern... but based on the thread - having a name associated with DVC - and being able to provide global/system override of these settings is a much more generic solution to just the cache-dir. dvc/.config
[core]
name = "some_name"
['remote "origin"']
url = s3://dvc-bucket/dev-repo
endpointurl = https://custom-endpoint.com
profile = your-profile
global or system dvc config:
[cache "some_n*"]
dir = /path/to/cache
url = s3://dvc-bucket/prod-repo |
Beta Was this translation helpful? Give feedback.
-
Right, so this is more about configuring the cache, and the remote was being proposed as a way to do that, but now we're thinking about adding a |
Beta Was this translation helpful? Give feedback.
-
Our team starting use DVC, working on different projects and different machines using some shared servers - (large # of cpu/gpu/memory, etc..)
The shared cache config is somewhat cumbersome - as its specific to a group of repos... One set of projects and remotes will use one cache, and another set would use a different cache, as there are different restrictions on data and who has access to the information. We have specifically used dvc.local to assign cache.dir to keep path/machine specific attributes out of git; however, this does mean that folks must remember to set the local cache.dir whenever they clone or setup a new training project w/dvc import (otherwise they end up creating a copy of the data). Users could set global - but this is cumbersome working with different caches and different projects (set/unset/etc/get wrong!).
What might be nice - if the cache could be configured based on the remote... This way a system and or global configuration could associate the correct shared cache dir.
somewhat better global config:
['remote "prject_a"']
url = s3://dvc-bucket/datasets/project_a_dataset
endpointurl = https://some.endpoint.com
profile = some-profile
cache_dir = "/project_a/shared_dvc/cache
Even better would be to allow configuration based on url matching, avoiding the need for a any remote or profile naming convention... This could enable the use of a system or global config that has bucket specific settings... as the bucket itself defines access control, and already implies a grouping of what data might be shared in a cache.
even better system or global config:
['remote "s3://dvc-bucket-a/datasets/*"']
cache_dir = "/project_a/shared_dvc/cache
profile = some-profile
['remote "s3://dvc-bucket-b/*"']
cache_dir = "/project_b/shared_dvc/cache
profile = some-profile
Beta Was this translation helpful? Give feedback.
All reactions