Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: running experiments is very slow #5638

Closed
courentin opened this issue Mar 16, 2021 · 10 comments · Fixed by #6049
Closed

exp: running experiments is very slow #5638

courentin opened this issue Mar 16, 2021 · 10 comments · Fixed by #6049
Assignees
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks upstream Issues which need to be resolved in an upstream dependency

Comments

@courentin
Copy link
Contributor

Bug Report

Description

Hello!

I'm trying to test the suggestion made in #5557 but running dvc exp run export_corpus_test_fr --set-param debug=true is taking a lot of time (~8min) while reproducing the stage without experiment does not take so much time.
I tried to add as many directories in .dvcignore as I could but it doesn't change anything. So I suspect something in dvc.

Environment information

I'm using the dvc version from master (862e18) and dulwich from master too (bbcc4b).

DVC version: 2.0.5+862e18
---------------------------------
Platform: Python 3.7.6 on Darwin-20.2.0-x86_64-i386-64bit
Supports: http, https, s3
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s2s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk1s2s1
Repo: dvc, git

Additional Information:

Screenshot 2021-03-16 at 18 56 27

Profiling information here: exp.prof.zip

@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2021

The issue is with computing git status (via dulwich), not DVC status, so using .dvcignore won't address the issue.

@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2021

@courentin @ettadar do you have large untracked directories in your workspaces?

I know @ettadar's issue mentioned .venv in particular. Is your venv directory gitignored? (if not, can you try adding it to your .gitignore and then see if the performance changes)

@pmrowla pmrowla added awaiting response we are waiting for your reply, please respond! :) performance improvement over resource / time consuming tasks labels Mar 17, 2021
@shcheklein
Copy link
Member

If it is the case with .venv not being git-ignored, it somewhat reminds me a discussion we had about DVC collecting stages and data files - should we ignore some common subset of directories (.env, node_modules, etc). I wonder if it makes sense in this kind of situations to at least write something if we some of those? Or may be mention in docs/faq?

@ettadar
Copy link

ettadar commented Mar 17, 2021

@courentin @ettadar do you have large untracked directories in your workspaces?

Nop.

I know @ettadar's issue mentioned .venv in particular. Is your venv directory gitignored? (if not, can you try adding it to your .gitignore and then see if the performance changes)

Yep, it is already gitignored. This is the output of git status:

On branch develop
Your branch is up to date with 'origin/develop'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   poetry.lock
        modified:   pyproject.toml

no changes added to commit (use "git add" and/or "git commit -a")

@pmrowla pmrowla added the p1-important Important, aka current backlog of things to do label Mar 17, 2021
@pmrowla pmrowla self-assigned this Mar 17, 2021
@pmrowla pmrowla added the upstream Issues which need to be resolved in an upstream dependency label Mar 17, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2021

Can confirm with our example-get-started. If a large gitignored directory (i.e. .venv) is present, dvc exp run goes from taking 1.8s in cprofile to 11.7s.

It looks like dulwich status is walking the entire directory when checking for untracked files (instead of stopping after the top level directory is ignored). This is probably also the cause of the performance issue reverted in #5596

@courentin
Copy link
Contributor Author

@courentin @ettadar do you have large untracked directories in your workspaces?

I have a .venv but it is gitignored

@pmrowla pmrowla removed the awaiting response we are waiting for your reply, please respond! :) label Mar 17, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Mar 17, 2021

Should be resolved by jelmer/dulwich#853

@jdonzallaz
Copy link

Should this be resolved by the recent dvc updates ?

@pmrowla
Copy link
Contributor

pmrowla commented Apr 1, 2021

@jdonzallaz the upstream fix for dulwich has not been merged yet, so this issue will still affect the latest DVC release (2.0.15)

@pmrowla
Copy link
Contributor

pmrowla commented Apr 20, 2021

For reference, while we are waiting on a dulwich release, anyone experiencing this issue can install the latest dulwich from source to get the fix (this will only work for DVC installed via pip)

pip install -U --force-reinstall git+https://github.com/dulwich/dulwich.git

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks upstream Issues which need to be resolved in an upstream dependency
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants