Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple .dvc roots in a single git repo #2349

Closed
guysmoilov opened this issue Jul 31, 2019 · 19 comments · Fixed by #3257
Closed

Support for multiple .dvc roots in a single git repo #2349

guysmoilov opened this issue Jul 31, 2019 · 19 comments · Fixed by #3257
Assignees
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension research

Comments

@guysmoilov
Copy link
Contributor

This request comes from a large company that has used DVC in the past but moved away mostly due to this issue.

Some companies (.e.g Google) store all of their different projects' and teams' code in a single big git repo.

In this scenario, each project has its own subdirectory in the repo, and they are expected to only make changes to that subdirectory, unless they are contributing code to another project in the organization.

Unfortunately, DVC can only create its .dvc folder in the git repo root. This is problematic for a couple of reasons:

  1. The data science team in question might want to work with DVC, but they are blocked from doing so, either just politically or even with automatically enforced authorizations
  2. DVC can break backwards compatibility, and different projects in the monorepo might use different versions of DVC

I could probably think of a couple other reasons why this might be a problem.

It seems to me that requiring the .dvc folder be in the root folder is pretty arbitrary, and giving the option to put it in other places in the tree would open the way for wider adoption.

@efiop efiop added the feature request Requesting a new feature label Jul 31, 2019
@efiop
Copy link
Contributor

efiop commented Jul 31, 2019

Hi @guysmoilov !

Great suggestion! Are you talking about dvc init specifically or running dvc commands after dvc init --no-scm? The latter one could be ran anywhere, but sequential commands from within that subrepo won't be able to use git. Though we would have to fix both in any case.

@efiop
Copy link
Contributor

efiop commented Jul 31, 2019

Btw, I suppose you are not talking about a repo consisting of git submodules, right? We are able to handle those right now. In that case to support this scenario, we'll have to

  1. allow dvc init in directories that are under git control but are not git repo root. We could consider doing this by default, but I'm afraid that some people will do that accidentally and then will be confused as to why they are not able to dvc add something in the parent dir. A safer choice is to introduce something like dvc init --sub.
  2. make Git class(and scm in general) search for git repo root up the tree, instead of just trying to use dvc repo root. Enabling this behavior by default might also be confusing, so we might introduce a config option that would be set by dvc init --sub that would tell dvc to behave this way.

Also, for both parts, we would be able to detect misuse and print some nice hints for users.

Overall, seems pretty straightforward. What are your thoughts, guys?

@shcheklein
Copy link
Member

@efiop sounds very reasonable to me.

@guysmoilov
Copy link
Contributor Author

@efiop Hi Ruslan, I was referring to a normal dvc init, in a normal git repo without submodules.
Just adding the ability to treat separate subdirectories in a normal git repo as separate dvc projects.
(And this is just my experience/opinion, but almost no one uses git submodules, it's really awkward)

@efiop
Copy link
Contributor

efiop commented Aug 1, 2019

@guysmoilov Thanks for clarifying! I agree that there are not that many people that using git submodules, but we've got that feature contributed by the user, so that says something 🙂

@efiop efiop added p2-medium Medium priority, should be done, but less important p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Aug 1, 2019
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 19, 2019
@sai-prasanna
Copy link

sai-prasanna commented Oct 13, 2019

@efiop Our team maintains multiple related projects in the same git repository and would love to get this feature running. I would like to get started contributing to this feature. I am new to dvc's code base, will start looking into it this week. Any pointers to what to look for and where would be great.

And what features would this change affect would also be great. e.g. would creating multiple roots in the same repository need GC code to be modified etc.

@pared
Copy link
Contributor

pared commented Oct 14, 2019

@sai-prasanna thank you for your interest in this issue!

To give you some starting point:

  1. You will surely need to comment-out this check
  2. when initializing DvcRepo we are assuming that git root is the same as dvc root here. That should be reconsidered.

Some useful methods should be under dvc/scm package, there lies main logic related to code version control.

I hope that will give you some grasp on the issue. Please ping us with any further questions.

@sai-prasanna
Copy link

@pared Thanks, will take a look.

Unfortunately I can't start working on it proper immediately. If anyone else want to work on this immediately, feel free.

@pared
Copy link
Contributor

pared commented Dec 4, 2019

Ok, for request from one of the users I will try to summarize what is needed to complete this task:

  1. Introduce --sub flag for dvc init: in case of using this flag, we should set proper config value
    so one would need to modify config, provide new config option that most likely be boolean.

  2. In few places where we initialize SCM (eg dvc init or when initializing git repo) we should be aware of mentioned config option and provide some logic to search for Git root up the tree, if its not avialable in the same dir as we are trying to initialize it. SCM is also used in analytics module, so that has to be checked too.

  3. We should probably detect when trying to use dvc init --sub in the same dir as .git is, as it indicates misuse of flag, and user should be hinted that he is probably not doing what he wants to do.

@efiop efiop added p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension and removed p2-medium Medium priority, should be done, but less important labels Jan 23, 2020
@efiop efiop added the research label Jan 28, 2020
@pared
Copy link
Contributor

pared commented Jan 29, 2020

Point from @shcheklein to think about:
What about use case when there is already git & dvc root and someone want's to initialize sub dvc repo somewhere down the tree?

@efiop
Copy link
Contributor

efiop commented Jan 29, 2020

@pared I think we should handle it as git in that case by just ignoring it when collecting stages 🙂 Basically the same as having sub-repo in the .dvcignore of the host repo

@pared
Copy link
Contributor

pared commented Jan 29, 2020

@efiop or, for starters we can simply forbid that :)

@pared pared self-assigned this Jan 29, 2020
@efiop
Copy link
Contributor

efiop commented Jan 29, 2020

@pared but that will break all of my tests, that I run in the dvc root 😄 I might be missing some issues here, but at least to me it feels like git-like behaviour is reasonable. Maybe you'll find some arguments against that.

@Suor
Copy link
Contributor

Suor commented Feb 3, 2020

We discussed this issue during planning and haven't come to any conclusion. My personal opinion that implementing this will be costly in the long run:

  • most probably lots of code in dvc implicitly presupposes dvc and git having same roots
  • most future features will need to be aware of this possibility
  • existing features will need to be adopted, we'll miss most of this on initial implementation and will be fixing related bugs for months
  • this will add another factor to our tests combinatoric explosion

So I expect this slowing us permanently or for a long period of time. This leads to a question - how valuable is this? Previously this was mentioned in the context of configurable/partial remotes in #2095 and #2825, maybe that would be enough for many cases?

@guysmoilov
Copy link
Contributor Author

@Suor Well I don't know about prioritization, but I've talked to several companies who wanted to use DVC but this was a 100% blocker for them. They won't change their whole organization's setup to use DVC. So if you want them to ever be users you'll probably need to support it at some point.

@Suor
Copy link
Contributor

Suor commented Feb 3, 2020

@guysmoilov but if it would be possible to use single dvc repo for git repo, but configure remotes by folder won't that be good enough?

@guysmoilov
Copy link
Contributor Author

@Suor I wouldn't think so. What makes remotes so special? I might want to use different versions of DVC for different parts of the tree for example, have separate caches, etc. Think completely different teams in the same organization, they might not know each other or not be in the same continent.

@pokey
Copy link

pokey commented Feb 4, 2020

Fwiw my use case should be solvable using configurable / partial remotes. We're building a centralized repo for all of our datasets, and some of them live in Box, others live in s3 buckets on different AWS accounts which can't be allowed to cross-contaminate etc. So if there is a robust solution to configurable / partial remotes that can convince our infosec people, I'm happy.

I think supporting separate caches might make the separation even cleaner

@shcheklein
Copy link
Member

@pokey do you have a clear separation - one remote per project? or do you want to keep different data types (e.g. models in one, datasets - another). It seems to that configurable remotes is a good features by itself - just trying to clarify what is the best way to achieve it:

  1. being able to specify per "output" aka data artifact certain options - do we need to push/pull it by default, which remote should be used by default, etc. Means you just change it in the .dvc files or specify when you run dvc add.
  2. set in the .dvc/config or somewhere nearby certain rules - mapping path (glob) -> remote.
  3. probably "specifying remotes per folder" is similar to the 2. cc @Suor

I would say this feature is more related to:

#2095
https://stackoverflow.com/questions/58952962/how-to-use-different-remotes-for-different-folders
#2095 (comment)

and a bunch of other related things with push/pull granularity

It's not directly related to the "multiple roots" support, but itself is a very common issue our user hit.

@pokey please chime in in the ^^ ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do product: VSCode Integration with VSCode extension research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants