-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multiple .dvc roots in a single git repo #2349
Comments
Hi @guysmoilov ! Great suggestion! Are you talking about |
Btw, I suppose you are not talking about a repo consisting of git submodules, right? We are able to handle those right now. In that case to support this scenario, we'll have to
Also, for both parts, we would be able to detect misuse and print some nice hints for users. Overall, seems pretty straightforward. What are your thoughts, guys? |
@efiop sounds very reasonable to me. |
@efiop Hi Ruslan, I was referring to a normal |
@guysmoilov Thanks for clarifying! I agree that there are not that many people that using git submodules, but we've got that feature contributed by the user, so that says something 🙂 |
@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great. And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc. |
@sai-prasanna thank you for your interest in this issue! To give you some starting point:
Some useful methods should be under I hope that will give you some grasp on the issue. Please ping us with any further questions. |
@pared Thanks, will take a look. Unfortunately I can't start working on it proper immediately. If anyone else want to work on this immediately, feel free. |
Ok, for request from one of the users I will try to summarize what is needed to complete this task:
|
Point from @shcheklein to think about: |
@pared I think we should handle it as git in that case by just ignoring it when collecting stages 🙂 Basically the same as having |
@efiop or, for starters we can simply forbid that :) |
@pared but that will break all of my tests, that I run in the dvc root 😄 I might be missing some issues here, but at least to me it feels like git-like behaviour is reasonable. Maybe you'll find some arguments against that. |
We discussed this issue during planning and haven't come to any conclusion. My personal opinion that implementing this will be costly in the long run:
So I expect this slowing us permanently or for a long period of time. This leads to a question - how valuable is this? Previously this was mentioned in the context of configurable/partial remotes in #2095 and #2825, maybe that would be enough for many cases? |
@Suor Well I don't know about prioritization, but I've talked to several companies who wanted to use DVC but this was a 100% blocker for them. They won't change their whole organization's setup to use DVC. So if you want them to ever be users you'll probably need to support it at some point. |
@guysmoilov but if it would be possible to use single dvc repo for git repo, but configure remotes by folder won't that be good enough? |
@Suor I wouldn't think so. What makes remotes so special? I might want to use different versions of DVC for different parts of the tree for example, have separate caches, etc. Think completely different teams in the same organization, they might not know each other or not be in the same continent. |
Fwiw my use case should be solvable using configurable / partial remotes. We're building a centralized repo for all of our datasets, and some of them live in Box, others live in s3 buckets on different AWS accounts which can't be allowed to cross-contaminate etc. So if there is a robust solution to configurable / partial remotes that can convince our infosec people, I'm happy. I think supporting separate caches might make the separation even cleaner |
@pokey do you have a clear separation - one remote per project? or do you want to keep different data types (e.g. models in one, datasets - another). It seems to that configurable remotes is a good features by itself - just trying to clarify what is the best way to achieve it:
I would say this feature is more related to: #2095 and a bunch of other related things with push/pull granularity It's not directly related to the "multiple roots" support, but itself is a very common issue our user hit. @pokey please chime in in the ^^ ticket. |
This request comes from a large company that has used DVC in the past but moved away mostly due to this issue.
Some companies (.e.g Google) store all of their different projects' and teams' code in a single big git repo.
In this scenario, each project has its own subdirectory in the repo, and they are expected to only make changes to that subdirectory, unless they are contributing code to another project in the organization.
Unfortunately, DVC can only create its
.dvc
folder in the git repo root. This is problematic for a couple of reasons:I could probably think of a couple other reasons why this might be a problem.
It seems to me that requiring the
.dvc
folder be in the root folder is pretty arbitrary, and giving the option to put it in other places in the tree would open the way for wider adoption.The text was updated successfully, but these errors were encountered: