Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run -d hdfs://: output does not exist #7561

Closed
ykasimov opened this issue Apr 8, 2022 · 3 comments · Fixed by #7563
Closed

run -d hdfs://: output does not exist #7561

ykasimov opened this issue Apr 8, 2022 · 3 comments · Fixed by #7563
Labels
bug Did we break something? fs: hdfs Related to the HDFS filesystem p1-important Important, aka current backlog of things to do

Comments

@ykasimov
Copy link

ykasimov commented Apr 8, 2022

Bug Report

Description

I am trying to add an external dependency which is stored on HDFS. I am running it inside this docker image: https://hub.docker.com/r/oneoffcoder/spark-jupyter which has hdfs installed and configured. The command to run the docker is there too.

When I run dvc run -v --force -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv I get the following error:

2022-04-08 14:37:28,188 ERROR: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 154, in run_stage
    stage.repo.stage_cache.restore(stage, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/cache.py", line 182, in restore
    raise RunCacheNotFoundError(stage)
dvc.stage.cache.RunCacheNotFoundError: No run-cache for download_file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/commands/run.py", line 47, in run
    self.repo.run(**kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/scm_context.py", line 152, in run
    return method(repo, *args, **kw)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/repo/run.py", line 33, in run
    stage.run(no_commit=no_commit, run_cache=run_cache)
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 535, in run
    self._run_stage(dry, force, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/decorators.py", line 36, in rwlocked
    return call()
  File "/usr/local/conda/lib/python3.7/site-packages/funcy/decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 553, in _run_stage
    return run_stage(self, dry, force, **kwargs)
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/run.py", line 157, in run_stage
    stage.save_deps()
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/stage/__init__.py", line 468, in save_deps
    dep.save()
  File "/usr/local/conda/lib/python3.7/site-packages/dvc/output.py", line 523, in save
    raise self.DoesNotExistError(self)
dvc.dependency.base.DependencyDoesNotExistError: dependency 'hdfs://localhost/data.csv' does not exist
------------------------------------------------------------
2022-04-08 14:37:28,305 DEBUG: Analytics is enabled.
2022-04-08 14:37:29,056 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'
2022-04-08 14:37:29,074 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpulp0j21m']'

The file data.csv exists in hdfs:

root@558dae789a66:~/ipynb/dvc-repo# hdfs dfs -ls hdfs://localhost/data.csv
-rw-r--r--   1 root supergroup       2950 2022-04-07 15:42 hdfs://localhost/data.csv

Reproduce

  1. run docker
  2. docker exec -it <container_id> /bin/bash
  3. apt-get update && apt-get install "dvc[hdfs]" git
  4. dvc init
  5. touch data.csv
  6. hdfs dfs -put data.csv hdfs://localhost/data.csv
  7. export CLASSPATH=$CLASSPATH:hdfs classpath --glob
  8. dvc run -v -n download_file -d hdfs://localhost/data.csv -o data.csv hdfs dfs -copyToLocal hdfs://localhost/data.csv data.csv

Expected

I expect that a file dvc.yaml will be created with the external dependency hdfs://localhost/data.csv

Environment information

everything runs inside docker. the only change is

export CLASSPATH=$CLASSPATH:`hdfs classpath --glob`

Output of dvc doctor:

$ dvc doctor

DVC version: 2.10.1 (pip)
---------------------------------
Platform: Python 3.7.11 on Linux-5.10.104-linuxkit-x86_64-with-debian-bullseye-sid
Supports:
        hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: fuse.grpcfuse on grpcfuse
Caches: local, hdfs
Remotes: hdfs, hdfs
Workspace directory: fuse.grpcfuse on grpcfuse
Repo: dvc, git

Additional Information (if any):

The issue is not present in DVC version 2.8.3

@pared
Copy link
Contributor

pared commented Apr 8, 2022

Might be related #7288.

@daavoo daavoo added fs: hdfs Related to the HDFS filesystem bug Did we break something? p1-important Important, aka current backlog of things to do labels Apr 8, 2022
@daavoo
Copy link
Contributor

daavoo commented Apr 9, 2022

@ykasimov Could you try installing dvc from this branch:

replacing:

apt-get install "dvc[hdfs]"

With:

pip install 'git+https://github.com/iterative/dvc.git@fix-hdfs-external-dep#egg=dvc[hdfs]'

@ykasimov
Copy link
Author

@daavoo it works from the branch. thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? fs: hdfs Related to the HDFS filesystem p1-important Important, aka current backlog of things to do
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants