Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc pull returns "failed to pull data" when the data exists on remote #6691

Closed
RadionBik opened this issue Sep 27, 2021 · 14 comments · Fixed by #6759
Closed

dvc pull returns "failed to pull data" when the data exists on remote #6691

RadionBik opened this issue Sep 27, 2021 · 14 comments · Fixed by #6759
Assignees
Labels
bug Did we break something? fs: gs Related to the Google Cloud Storage filesystem p0-critical Critical issue. Needs to be fixed ASAP. research

Comments

@RadionBik
Copy link

Bug Report

Issue name

dvc pull returns "failed to pull data" when the data exists on remote

Description

dvc pull (also tried with -R option) fails to pull remote data basing on .dvc files from sub-directories and returns ERROR: failed to pull data from the cloud - Checkout failed for following targets:..., however, when I run the pull cmd on failed files individually, the cmd succeeds.

(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                                                                                                                                                                                    
name: document_labelling_utils/annotation_results/1000_recent_documents_20210413.json, md5: 06a0a6ef5b6446a33623a544ede8bbfd
1 file failed                                                                                                                                                                                                                                                                                                                                                        
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
document_labelling_utils/annotation_results/1000_recent_documents_20210413.json
Is your cache up to date?
<https://error.dvc.org/missing-files>
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull -R
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                                                                                                                                                                                    
name: document_labelling_utils/annotation_results/1000_recent_documents_20210413.json, md5: 06a0a6ef5b6446a33623a544ede8bbfd
1 file failed                                                                                                                                                                                                                                                                                                                                                        
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
document_labelling_utils/annotation_results/1000_recent_documents_20210413.json
Is your cache up to date?
<https://error.dvc.org/missing-files>
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull document_labelling_utils/annotation_results/1000_recent_documents_20210413.json.dvc 
A       document_labelling_utils/annotation_results/1000_recent_documents_20210413.json                                                                                                                                                                                                                                                                              
1 file added and 1 file fetched                                                                                                                                                                                                                                                                                                                                      
(onboarding_models) radion@MacBook-Pro-Radion anna-datascience % dvc pull                                                                                    
Everything is up to date.                                                                                                                                                                                                                                                                                                                                            

Expected

I expect dvc pull to download missing files from sub-directories without the need to run it on each .dvc file.

Environment information

Output of dvc doctor:

DVC version: 2.7.2 (brew)
---------------------------------
Platform: Python 3.9.7 on macOS-11.2.1-x86_64-i386-64bit
Supports:
        azure (adlfs = 2021.8.2, knack = 0.8.2, azure-identity = 1.6.1),
        gdrive (pydrive2 = 1.9.3),
        gs (gcsfs = 2021.8.1),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
        s3 (s3fs = 2021.8.1, boto3 = 1.17.106),
        webdav (webdav4 = 0.9.1),
        webdavs (webdav4 = 0.9.1)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s1s1
Caches: local
Remotes: gs
Workspace directory: apfs on /dev/disk1s1s1
Repo: dvc, git
@pmrowla
Copy link
Contributor

pmrowla commented Sep 29, 2021

@RadionBik are your .dvc files inside .gitignore'd directories?

@pmrowla pmrowla added the awaiting response we are waiting for your reply, please respond! :) label Sep 29, 2021
@RadionBik
Copy link
Author

no, the directories are not git-ignored.

@clementperon
Copy link

Hi,

I think I have the same issue.

Pulling with the cmd
dvc pull -f data/toto.dvc data/tutu.dvc data/tete.dvc

Will create a warning
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:

If I pull each file one by one it's OK.

$> dvc doctor

DVC version: 2.7.4 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.13.0-7614-generic-x86_64-with-glibc2.33
Supports:
	gs (gcsfs = 2021.8.1),
	http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5),
	https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.5)

@efiop
Copy link
Contributor

efiop commented Sep 29, 2021

@clementperon Could you show the contents of those dvc files, please? Also, full verbose error log would be really helpful.

@clementperon
Copy link

clementperon commented Sep 29, 2021

$> cat data/2021_06_29_some_text_here_in_snakecase.pcap.dvc

- md5: 47a9b6d2693147f689ff5ebe12c78a05
  size: 2284260388
  path: 2021_06_29_some_text_here_in_snakecase.pcap

@efiop efiop added the research label Sep 29, 2021
@clementperon
Copy link

$ dvc pull -f data/file_1.pcap.dvc data/file_2.ply.dvc
Everything is up to date.
$ rm .dvc/cache/47/a9b6d2693147f689ff5ebe12c78a05
rm: remove write-protected regular file '.dvc/cache/47/a9b6d2693147f689ff5ebe12c78a05'? y
$ dvc pull -f data/file_1.pcap.dvc data/file_2.ply.dvc

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:                                                                                                                                                             
name: data/file_1.pcap, md5: 47a9b6d2693147f689ff5ebe12c78a05
ERROR: unexpected error                                                                                                                                                                                                                                       

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

$ dvc pull -f data/file_1.pcap.dvc

  0% Transferring| 
|0/1 [00:00<?,     ?file/s^C14%|█▎        |a9b6d2693147f689ff5ebe12c78a05

@clementperon
Copy link

Please let me know, if you want more details / cmd to execute.

@pmrowla
Copy link
Contributor

pmrowla commented Oct 5, 2021

I'm unable to reproduce this with local or s3 remotes, it may be something specific to gs/gcsfs

@clementperon
Copy link

The issue has been introduced in 2.5.0.

$> pip install dvc[gs]==2.4.3
dvc pull -f toto1.dvc toto2.dvc

Is OK

$> pip install dvc[gs]==2.5.0
dvc pull -f toto1.dvc toto2.dvc

Failed !!

@clementperon
Copy link

clementperon commented Oct 6, 2021

Doing git bisect:

Turn this commit trig the issue
commit e3ff6d5 (refs/bisect/bad)
Author: Batuhan Taskaya batuhanosmantaskaya@gmail.com
Date: Thu Jul 1 14:59:35 2021 +0300

fsspec: loosen the prefix check to cover both None and False (#6246)

* fsspec: loosen the prefix check to cover both None and False

* gs: disable prefix-based search

Can confirm:

$> git reset --hard 2.5.0
$> git revert e3ff6d5
$> python3 -m build
$> python3 -m pip install dist/dvc-2.5.0+7d239d.tar.gz
$> dvc pull -f data1.dvc data2.dvc

Works fine !

@pmrowla pmrowla added fs: gs Related to the Google Cloud Storage filesystem bug Did we break something? and removed awaiting response we are waiting for your reply, please respond! :) labels Oct 6, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Oct 6, 2021

cc @isidentical

@clementperon
Copy link

clementperon commented Oct 7, 2021

Looks like removing the TRAVERSE_PREFIX_LEN fix my issue.

diff --git a/dvc/fs/gs.py b/dvc/fs/gs.py
index 6ee6d735..2d73f063 100644
--- a/dvc/fs/gs.py
+++ b/dvc/fs/gs.py
@@ -16,7 +16,6 @@ class GSFileSystem(CallbackMixin, ObjectFSWrapper):
     REQUIRES = {"gcsfs": "gcsfs"}
     PARAM_CHECKSUM = "etag"
     DETAIL_FIELDS = frozenset(("etag", "size"))
-    TRAVERSE_PREFIX_LEN = 2
 
     def _prepare_credentials(self, **config):
         login_info = {"consistency": None}

Tested on master.

isidentical added a commit that referenced this issue Oct 7, 2021
Resolves #6691. Normally it should have inherited this from the `ObjectFSWrapper`, but this seems like a leftover.
@isidentical isidentical self-assigned this Oct 7, 2021
@efiop efiop added the p0-critical Critical issue. Needs to be fixed ASAP. label Oct 7, 2021
@efiop
Copy link
Contributor

efiop commented Oct 7, 2021

@clementperon Thank you for the research! 🙏

@clementperon
Copy link

@efiop Pleasure is mine 🙂.

pmrowla pushed a commit that referenced this issue Oct 11, 2021
Resolves #6691. Normally it should have inherited this from the `ObjectFSWrapper`, but this seems like a leftover.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? fs: gs Related to the Google Cloud Storage filesystem p0-critical Critical issue. Needs to be fixed ASAP. research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants