Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: cannot clean up temp directory runs on Linux + NFS #7458

Closed
karajan1001 opened this issue Mar 14, 2022 · 13 comments · Fixed by iterative/scmrepo#51
Closed

exp run: cannot clean up temp directory runs on Linux + NFS #7458

karajan1001 opened this issue Mar 14, 2022 · 13 comments · Fixed by iterative/scmrepo#51
Assignees
Labels
git Related to git and git backends p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-( research

Comments

@karajan1001
Copy link
Contributor

Hello,

I am having this issue still. When I check the pack directories, one is empty and one has files in it:

(almds_dl) [starrgw1@login01 dvctest]$ ls .dvc/tmp/exps/tmpm5ix9f4j/.git/objects/pack/
pack-1186ca2730fcc6628a18da2631e206dc5d09791b.idx  pack-1186ca2730fcc6628a18da2631e206dc5d09791b.pack
(almds_dl) [starrgw1@login01 dvctest]$ ls .dvc/tmp/exps/tmpb02b0lgm/.git/objects/pack/
(almds_dl) [starrgw1@login01 dvctest]$

Below is the debug output:

(almds_dl) [starrgw1@login01 dvctest]$ dvc exp run --run-all -j 2 -v
2022-03-10 19:21:22,130 DEBUG: Reproducing experiment revs '96d579e, 03eef69'
2022-03-10 19:21:22,195 DEBUG: Writing experiments local config '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.dvc/config.local'
2022-03-10 19:21:22,195 DEBUG: Init temp dir executor in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm'
2022-03-10 19:21:22,227 DEBUG: Writing experiments local config '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/.dvc/config.local'
2022-03-10 19:21:22,227 DEBUG: Init temp dir executor in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j'
2022-03-10 19:21:22,337 DEBUG: Running repro in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j'
2022-03-10 19:21:22,338 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/.dvc/tmp/repro.dat'
2022-03-10 19:21:22,338 DEBUG: Running repro in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm'
2022-03-10 19:21:22,338 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.dvc/tmp/repro.dat'
2022-03-10 19:21:22,502 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,502 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,503 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,503 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,506 DEBUG: Dependency 'params.yaml' of stage: 'stage1' changed because it is '{'input_text': 'modified'}'.
2022-03-10 19:21:22,506 DEBUG: Dependency 'params.yaml' of stage: 'stage1' changed because it is '{'input_text': 'modified'}'.
2022-03-10 19:21:22,506 DEBUG: stage: 'stage1' changed.
2022-03-10 19:21:22,506 DEBUG: stage: 'stage1' changed.
2022-03-10 19:21:22,508 DEBUG: Removing output 'metrics.json' of stage: 'stage1'.
2022-03-10 19:21:22,508 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/metrics.json'
2022-03-10 19:21:22,508 DEBUG: Removing output 'metrics.json' of stage: 'stage1'.
2022-03-10 19:21:22,509 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/metrics.json'
2022-03-10 19:21:22,511 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,511 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,514 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,514 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,514 DEBUG: {}
2022-03-10 19:21:22,515 DEBUG: {}
2022-03-10 19:21:22,516 DEBUG: defaultdict(<class 'dict'>, {'params.yaml': {'input_text': 'modified'}})
2022-03-10 19:21:22,516 DEBUG: defaultdict(<class 'dict'>, {'params.yaml': {'input_text': 'modified'}})
2022-03-10 19:21:22,517 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,518 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,520 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,520 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,523 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:22,523 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
Running stage 'stage1':
Running stage 'stage1':
> python submit_job.py stage1.py
> python submit_job.py stage1.py
/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/stage1.py
/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/stage1.py
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='Q', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120799.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
JobStatus(job_id='120800.vectivus.cm.cluster', name='qsub_script.sh', user='starrgw1', time_use='0', status='R', queue='short')
2022-03-10 19:21:58,158 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,162 DEBUG: state save (2286788293, 1646958103056720896, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,162 DEBUG: state save (2286788293, 1646958103056720896, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,163 DEBUG: Output 'metrics.json' doesn't use cache. Skipping saving.
2022-03-10 19:21:58,164 DEBUG: Computed stage: 'stage1' md5: '8aa9486314f0f8befb9277b8eb4e8def'
2022-03-10 19:21:58,165 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,168 DEBUG: state save (2286788291, 1646958082216025344, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,176 DEBUG: state save (2286788293, 1646958103056720896, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,178 DEBUG: Preparing to transfer data from 'memory://dvc-staging/d701057ee2ce3fdcd5408c3336f74eedf76e8d2655257ef739b3ea22a3904799' to '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,178 DEBUG: Preparing to collect status from '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,178 DEBUG: Collecting status from '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,179 DEBUG: Preparing to collect status from 'memory://dvc-staging/d701057ee2ce3fdcd5408c3336f74eedf76e8d2655257ef739b3ea22a3904799'
2022-03-10 19:21:58,182 DEBUG: state save (2286788293, 1646958103056720896, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,187 DEBUG: Uploading '/home/starrgw1/code/dvctest/.dvc/cache/.J4vv5NDtBQtjSh8ycSkWhC.tmp' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/.SnbjvSEZ76cQkVrwvWcLN6.tmp'
2022-03-10 19:21:58,189 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/.SnbjvSEZ76cQkVrwvWcLN6.tmp'
2022-03-10 19:21:58,189 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/cache/.J4vv5NDtBQtjSh8ycSkWhC.tmp'
2022-03-10 19:21:58,197 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/metrics.json'
2022-03-10 19:21:58,198 DEBUG: Uploading '/home/starrgw1/code/dvctest/.dvc/cache/11/66a8fbe4acb9cbfd182cfbb5fd9fdf' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpm5ix9f4j/metrics.json'
2022-03-10 19:21:58,200 DEBUG: state save (2286788296, 1646958118198226432, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,205 DEBUG: state save (2286788296, 1646958118198226432, 8) 1166a8fbe4acb9cbfd182cfbb5fd9fdf
2022-03-10 19:21:58,210 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,211 DEBUG: stage: 'stage1' was reproduced
2022-03-10 19:21:58,213 DEBUG: state save (1211851840, 1646958108677908736, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,214 DEBUG: state save (1211851840, 1646958108677908736, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,215 DEBUG: Output 'metrics.json' doesn't use cache. Skipping saving.
2022-03-10 19:21:58,216 DEBUG: Computed stage: 'stage1' md5: '7eb88065797b13721ef3ffa7cf7b13ed'
Updating lock file 'dvc.lock'
2022-03-10 19:21:58,217 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,220 DEBUG: state save (1211851838, 1646958082182024192, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-10 19:21:58,224 DEBUG: Staging files: {'stage1.py', 'dvc.yaml', 'params.yaml', 'dvc.lock', 'metrics.json'}
2022-03-10 19:21:58,227 DEBUG: state save (1211851840, 1646958108677908736, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,228 DEBUG: Preparing to transfer data from 'memory://dvc-staging/d701057ee2ce3fdcd5408c3336f74eedf76e8d2655257ef739b3ea22a3904799' to '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,228 DEBUG: Preparing to collect status from '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,228 DEBUG: Collecting status from '/home/starrgw1/code/dvctest/.dvc/cache'
2022-03-10 19:21:58,229 DEBUG: Preparing to collect status from 'memory://dvc-staging/d701057ee2ce3fdcd5408c3336f74eedf76e8d2655257ef739b3ea22a3904799'
2022-03-10 19:21:58,231 DEBUG: state save (1211851840, 1646958108677908736, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,236 DEBUG: Uploading '/home/starrgw1/code/dvctest/.dvc/cache/.MBmG3RvyuNGA6zEyzrYayc.tmp' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.nDJoTijpRqfnoaZLmjd2Bk.tmp'
2022-03-10 19:21:58,238 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.nDJoTijpRqfnoaZLmjd2Bk.tmp'
2022-03-10 19:21:58,238 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/cache/.MBmG3RvyuNGA6zEyzrYayc.tmp'
2022-03-10 19:21:58,238 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/metrics.json'
2022-03-10 19:21:58,239 DEBUG: Uploading '/home/starrgw1/code/dvctest/.dvc/cache/9b/7916dcfbccc49c18581fd80884fa56' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/metrics.json'
2022-03-10 19:21:58,240 DEBUG: Commit to new experiment branch 'refs/exps/96/bf9d4272ad61392a913bf8f1e7f77faf69defb/exp-bb312'
2022-03-10 19:21:58,240 DEBUG: state save (1211851845, 1646958118239227648, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,245 DEBUG: state save (1211851845, 1646958118239227648, 8) 9b7916dcfbccc49c18581fd80884fa56
2022-03-10 19:21:58,252 DEBUG: stage: 'stage1' was reproduced
Updating lock file 'dvc.lock'
2022-03-10 19:21:58,265 DEBUG: Staging files: {'stage1.py', 'dvc.yaml', 'params.yaml', 'dvc.lock', 'metrics.json'}
2022-03-10 19:21:58,268 WARNING: The following untracked files were present in the experiment directory after reproduction but will not be included in experiment commits:
        qsub_script.sh, qsub_script.sh.o120799, qsub_script.sh.e120799
2022-03-10 19:21:58,284 DEBUG: Commit to new experiment branch 'refs/exps/96/bf9d4272ad61392a913bf8f1e7f77faf69defb/exp-b936d'
2022-03-10 19:21:58,308 WARNING: The following untracked files were present in the experiment directory after reproduction but will not be included in experiment commits:
        qsub_script.sh, qsub_script.sh.o120800, qsub_script.sh.e120800
2022-03-10 19:21:58,325 DEBUG: Collected experiment '2ba86e6'.
2022-03-10 19:21:58,326 DEBUG: Removing tmpdir '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm'
2022-03-10 19:21:58,326 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm'
2022-03-10 19:21:58,337 ERROR: unexpected error - [Errno 39] Directory not empty: '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.git/objects/pack'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 657, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'pack'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/cli/__init__.py", line 78, in main
    ret = cmd.do_run()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/commands/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 825, in run
    return run(self.repo, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 28, in run
    return repo.experiments.reproduce_queued(jobs=jobs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 457, in reproduce_queued
    results = self._reproduce_revs(**kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 53, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 644, in _reproduce_revs
    exec_results.update(self._executors_repro(manager, **kwargs))
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 64, in wrapper
    ret = f(exp, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 675, in _executors_repro
    return manager.exec_queue(self.repo, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 159, in exec_queue
    return self._exec_attached(repo, jobs=jobs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 232, in _exec_attached
    self.cleanup_executor(rev, executor)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 270, in cleanup_executor
    executor.cleanup()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 110, in cleanup
    remove(self.root_dir)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove
    shutil.rmtree(path, onerror=_chmod)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 659, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod
    func(p)
OSError: [Errno 39] Directory not empty: '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.git/objects/pack'
------------------------------------------------------------
2022-03-10 19:21:58,503 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 657, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
OSError: [Errno 39] Directory not empty: 'pack'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/cli/__init__.py", line 78, in main
    ret = cmd.do_run()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/commands/experiments/run.py", line 32, in run
    results = self.repo.experiments.run(
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 825, in run
    return run(self.repo, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/run.py", line 28, in run
    return repo.experiments.reproduce_queued(jobs=jobs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 457, in reproduce_queued
    results = self._reproduce_revs(**kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 53, in wrapper
    return f(exp, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 644, in _reproduce_revs
    exec_results.update(self._executors_repro(manager, **kwargs))
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 64, in wrapper
    ret = f(exp, *args, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/__init__.py", line 675, in _executors_repro
    return manager.exec_queue(self.repo, **kwargs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 159, in exec_queue
    return self._exec_attached(repo, jobs=jobs)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 232, in _exec_attached
    self.cleanup_executor(rev, executor)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/manager/base.py", line 270, in cleanup_executor
    executor.cleanup()
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/repo/experiments/executor/local.py", line 110, in cleanup
    remove(self.root_dir)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/utils/fs.py", line 135, in remove
    shutil.rmtree(path, onerror=_chmod)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/shutil.py", line 659, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/utils/fs.py", line 120, in _chmod
    func(p)
OSError: [Errno 39] Directory not empty: '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpb02b0lgm/.git/objects/pack'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/local.py", line 144, in reflink
    System.reflink(from_info, to_info)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 95] Operation not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/starrgw1/.conda/envs/almds_dl/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-03-10 19:21:58,504 DEBUG: Removing '/home/starrgw1/code/.PHQHFmRzwbSbJrd69xLnsE.tmp'
2022-03-10 19:21:58,504 DEBUG: Removing '/home/starrgw1/code/.PHQHFmRzwbSbJrd69xLnsE.tmp'
2022-03-10 19:21:58,504 DEBUG: Removing '/home/starrgw1/code/.PHQHFmRzwbSbJrd69xLnsE.tmp'
2022-03-10 19:21:58,505 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/cache/.C7nbLHG6hDepE3gwJDP8Jb.tmp'
2022-03-10 19:21:58,585 DEBUG: Version info for developers:
DVC version: 2.9.4 (conda)
---------------------------------
Platform: Python 3.8.12 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.10
Supports:
        webhdfs (fsspec = 2022.2.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: nfs on master:/home
Caches: local
Remotes: None
Workspace directory: nfs on master:/home
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-03-10 19:21:58,586 DEBUG: Analytics is enabled.
2022-03-10 19:21:58,618 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpq2t5205u']'
2022-03-10 19:21:58,619 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpq2t5205u']'

Originally posted by @gregstarr in #5641 (comment)

@pmrowla pmrowla self-assigned this Mar 14, 2022
@pmrowla pmrowla added research git Related to git and git backends regression Ohh, we broke something :-( labels Mar 14, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Mar 14, 2022

This is the same underlying problem as #5641. I looked into this some more and the existing workaround probably won't work for us now that we use a lot more pygit functionality. scmrepo will need some updates to make the behavior safer when we are using mixed git backends.

@gregstarr
Copy link

Thanks for the response. Can you think of any temporary workaround for me to use?

@karajan1001
Copy link
Contributor Author

karajan1001 commented Mar 15, 2022

Returning back to a previous version might work, but #5641 was solved long time ago, not recommend to roll back to such an old version

@pmrowla pmrowla added this to DVC Mar 15, 2022
@pmrowla pmrowla moved this to Backlog in DVC Mar 15, 2022
@pmrowla pmrowla added the p1-important Important, aka current backlog of things to do label Mar 15, 2022
@gregstarr
Copy link

Is this bug caused by a change in a dependency? Or by DVC development? Do you have an idea of which change caused the bug?

This is the same underlying problem as #5641. I looked into this some more and the existing workaround probably won't work for us now that we use a lot more pygit functionality. scmrepo will need some updates to make the behavior safer when we are using mixed git backends.

Would not using mixed git backends help?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 15, 2022

There's not really a quick workaround for this issue right now, it's due to a lot of DVC changes that have been added in the last year since #5641 was addressed.

Would not using mixed git backends help?

The short answer is "yes", but the long answer is "it's more complicated than that", as we need different functionality from each of the backends we use (since each backend does not support the same featureset).

Since this is a regression, this issue is something that we will prioritize and hopefully will have a fix for it within the next couple of weeks.

@gregstarr
Copy link

Ok gotcha. Is there anything I can do to help?

@gregstarr
Copy link

I haven't tried any others, but it looks like at least 2.5.4 does not have this issue

@gregstarr
Copy link

2.9.3 is the first version which has this issue, it also looks like this version updates scmrepo from 0.0.4 to 0.0.7

@pmrowla
Copy link
Contributor

pmrowla commented Mar 28, 2022

@gregstarr can you please test the linked PR and see if it resolves your issue? To test it you will need to set up a new virtualenv and then install DVC + the PR version of scmrepo via pip:

pip install dvc
pip install git+https://github.com/iterative/scmrepo.git@refs/pull/51/head

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Mar 28, 2022
@gregstarr
Copy link

@gregstarr can you please test the linked PR and see if it resolves your issue? To test it you will need to set up a new virtualenv and then install DVC + the PR version of scmrepo via pip:

pip install dvc
pip install git+https://github.com/iterative/scmrepo.git@refs/pull/51/head

Yes I believe I can get to this today.

@pmrowla pmrowla moved this from In Progress to Review In Progress in DVC Mar 29, 2022
@gregstarr
Copy link

@pmrowla Is there a specific version of dvc and scmrepo I should be using? I notice that dvc main requires scmrepo 0.0.14. When I install scmrepo with the command you sent, it installs scmrepo 0.0.13 and pip complains because my version of dvc (2.9.5) requires 0.0.7. I'm pretty sure it didn't mess up the installation of scmrepo but I just wanted to check.

@gregstarr
Copy link

pip list:

(dvctest) [starrgw1@vn-059 dvctest]$ pip list
Package            Version
------------------ --------------------
aiohttp            3.8.1
aiohttp-retry      2.4.6
aiosignal          1.2.0
appdirs            1.4.4
async-timeout      4.0.2
asyncssh           2.8.1
atpublic           2.3
attrs              21.4.0
certifi            2021.10.8
cffi               1.14.5
chardet            4.0.0
charset-normalizer 2.0.12
colorama           0.4.4
commonmark         0.9.1
configobj          5.0.6
cryptography       36.0.2
decorator          4.4.2
dictdiffer         0.8.1
diskcache          5.2.1
distro             1.5.0
dpath              2.0.6
dulwich            0.20.35
dvc                2.9.6.dev56+g5ff1d76
dvc-render         0.0.4
flatten-dict       0.4.2
flufl.lock         7.0
frozenlist         1.3.0
fsspec             2022.2.0
ftfy               6.0.3
funcy              1.17
future             0.18.2
gitdb              4.0.7
GitPython          3.1.18
grandalf           0.6
idna               2.10
jsonpath-ng        1.5.2
mailchecker        4.1.14
multidict          6.0.2
nanotime           0.5.2
networkx           2.5.1
packaging          20.9
pathspec           0.9.0
phonenumbers       8.12.25
pip                21.2.4
ply                3.11
psutil             5.9.0
pyasn1             0.4.8
pycparser          2.20
pydot              1.4.2
pygit2             1.9.1
Pygments           2.9.0
pygtrie            2.3.2
pyparsing          2.4.7
python-benedict    0.25.0
python-dateutil    2.8.1
python-fsutil      0.6.0
python-slugify     6.1.1
PyYAML             6.0
requests           2.27.1
rich               12.0.1
ruamel.yaml        0.17.21
ruamel.yaml.clib   0.2.6
scmrepo            0.0.13
setuptools         58.0.4
shortuuid          1.0.1
shtab              1.3.6
six                1.16.0
smmap              4.0.0
tabulate           0.8.9
text-unidecode     1.3
toml               0.10.2
tqdm               4.63.1
typing-extensions  3.10.0.0
urllib3            1.26.5
voluptuous         0.12.1
wcwidth            0.2.5
wheel              0.37.1
xmltodict          0.12.0
yarl               1.7.2
zc.lockfile        2.0

Yes I think this worked:

(dvctest) [starrgw1@vn-059 dvctest]$ dvc exp run --temp -v
2022-03-29 12:23:26,411 DEBUG: Stashed experiment '5f782f8' with baseline 'd89221f' for future execution.
2022-03-29 12:23:26,424 DEBUG: Reproducing experiment revs '5f782f8'
2022-03-29 12:23:26,480 DEBUG: Writing experiments local config '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.dvc/config.local'
2022-03-29 12:23:26,481 DEBUG: Init temp dir executor in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1'
2022-03-29 12:23:26,594 DEBUG: Running repro in '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1'
2022-03-29 12:23:26,594 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.dvc/tmp/repro.dat'
2022-03-29 12:23:26,849 DEBUG: 'cmd' of stage: 'stage1' has changed.
2022-03-29 12:23:26,849 DEBUG: stage: 'stage1' changed.
2022-03-29 12:23:26,851 DEBUG: Removing output 'metrics.json' of stage: 'stage1'.
2022-03-29 12:23:26,851 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/metrics.json'
2022-03-29 12:23:26,860 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,862 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,866 DEBUG: 'cmd' of stage: 'stage1' has changed.
2022-03-29 12:23:26,871 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,876 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
Stage 'stage1' is cached - skipping run, checking out outputs
2022-03-29 12:23:26,923 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'hardlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross-device link: '/scratch/tmp/starrgw1/almds/dvc_cache/.LhczmoeY6dmyt4o7AMw6s2.tmp' -> '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.hDR8w9snJRo8kg4ZxTNBfF.tmp'
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/local.py", line 138, in hardlink
    System.hardlink(from_info, to_info)
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/system.py", line 39, in hardlink
    os.link(src, link_name)
OSError: [Errno 18] Invalid cross-device link: '/scratch/tmp/starrgw1/almds/dvc_cache/.LhczmoeY6dmyt4o7AMw6s2.tmp' -> '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.hDR8w9snJRo8kg4ZxTNBfF.tmp'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'hardlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/starrgw1/.conda/envs/dvctest/lib/python3.9/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-03-29 12:23:26,924 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.hDR8w9snJRo8kg4ZxTNBfF.tmp'
2022-03-29 12:23:26,925 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.hDR8w9snJRo8kg4ZxTNBfF.tmp'
2022-03-29 12:23:26,925 DEBUG: Removing '/scratch/tmp/starrgw1/almds/dvc_cache/.LhczmoeY6dmyt4o7AMw6s2.tmp'
2022-03-29 12:23:26,927 DEBUG: state save (1211867766, 1648570555000000000, 8) 66931171b8384ae18e7372b768728b49
2022-03-29 12:23:26,936 DEBUG: state save (1211867766, 1648570555000000000, 8) 66931171b8384ae18e7372b768728b49
2022-03-29 12:23:26,943 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,954 DEBUG: Output 'metrics.json' doesn't use cache. Skipping saving.
2022-03-29 12:23:26,955 DEBUG: Computed stage: 'stage1' md5: 'c57d5879726ce79eb49d8bd16079ed9d'
2022-03-29 12:23:26,957 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,961 DEBUG: state save (1211867769, 1648571006464925184, 350) 1079e31771794bac9a75210e2ac3ffda
2022-03-29 12:23:26,969 DEBUG: Preparing to transfer data from '/scratch/tmp/starrgw1/almds/dvc_cache' to '/scratch/tmp/starrgw1/almds/dvc_cache'
2022-03-29 12:23:27,024 DEBUG: Uploading '/scratch/tmp/starrgw1/almds/dvc_cache/.dmit5qefAPdLUDd3pTfXcP.tmp' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.nhrpUY4BjfRn9Zcubnc935.tmp'
2022-03-29 12:23:27,034 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/.nhrpUY4BjfRn9Zcubnc935.tmp'
2022-03-29 12:23:27,034 DEBUG: Removing '/scratch/tmp/starrgw1/almds/dvc_cache/.dmit5qefAPdLUDd3pTfXcP.tmp'
2022-03-29 12:23:27,035 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/metrics.json'
2022-03-29 12:23:27,037 DEBUG: Uploading '/scratch/tmp/starrgw1/almds/dvc_cache/66/931171b8384ae18e7372b768728b49' to '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1/metrics.json'
2022-03-29 12:23:27,047 DEBUG: state save (1211867766, 1648571007043945472, 8) 66931171b8384ae18e7372b768728b49
2022-03-29 12:23:27,054 DEBUG: state save (1211867766, 1648571007043945472, 8) 66931171b8384ae18e7372b768728b49
2022-03-29 12:23:27,057 DEBUG: stage: 'stage1' was reproduced
Updating lock file 'dvc.lock'
2022-03-29 12:23:27,072 DEBUG: Staging files: {'params.yaml', 'dvc.lock', 'metrics.json', 'stage1.py', 'dvc.yaml'}
2022-03-29 12:23:27,090 DEBUG: Commit to new experiment branch 'refs/exps/d8/9221fd790b303b793558b0cd64d92cf49501f2/exp-650c6'
2022-03-29 12:23:27,159 DEBUG: Collected experiment '929d5c7'.
2022-03-29 12:23:27,159 DEBUG: Removing tmpdir '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1'
2022-03-29 12:23:27,159 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/tmpl8lr7ph1'
2022-03-29 12:23:27,170 DEBUG: Removing '/home/starrgw1/code/dvctest/.dvc/tmp/exps/run/5f782f808d056fd8e4b46538c526e5a3584e47ee'

Reproduced experiment(s): exp-650c6
To apply the results of an experiment to your workspace run:

        dvc exp apply <exp>

To promote an experiment to a Git branch run:

        dvc exp branch <exp> <branch>

2022-03-29 12:23:27,193 DEBUG: Analytics is enabled.
2022-03-29 12:23:27,303 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/scratch/pbstmp/121772.vectivus.cm.cluster/tmpo8mingy2']'
2022-03-29 12:23:27,304 DEBUG: Spawned '['daemon', '-q', 'analytics', '/scratch/pbstmp/121772.vectivus.cm.cluster/tmpo8mingy2']'

@pmrowla
Copy link
Contributor

pmrowla commented Mar 30, 2022

@pmrowla Is there a specific version of dvc and scmrepo I should be using? I notice that dvc main requires scmrepo 0.0.14. When I install scmrepo with the command you sent, it installs scmrepo 0.0.13 and pip complains because my version of dvc (2.9.5) requires 0.0.7. I'm pretty sure it didn't mess up the installation of scmrepo but I just wanted to check.

Your version of DVC has a pinned scmrepo version of 0.0.7, but as you noted pip will still allow you to force installation of the more recent scmrepo manually.

Yes I think this worked

Thanks, this fix should be available soon then

@pmrowla pmrowla removed the awaiting response we are waiting for your reply, please respond! :) label Mar 30, 2022
Repository owner moved this from Review In Progress to Done in DVC Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
git Related to git and git backends p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-( research
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants