Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gdrive remote strange behaviour #3230

Closed
RomanVeretenov opened this issue Jan 24, 2020 · 20 comments
Closed

gdrive remote strange behaviour #3230

RomanVeretenov opened this issue Jan 24, 2020 · 20 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@RomanVeretenov
Copy link

RomanVeretenov commented Jan 24, 2020

DVC version - 0.80.0, Installed via pip
Ubuntu 18.04.2 LTS

To continue situation described in this issue:

I've created 2 remotes in .dvc/config

  • local remote in local network called lremote. Set as default.
  • gdrive remote called gremote. All gdrive preparations are done (app created inside my account, key & secret generated as described in 'dvc remote add' reference

Syncing with local remote works fine.
Syncing with gdrive remote behaves in some strange way:

Situation 1:
I do dvc push -r gremote, wait until all data is uploaded (it really appears in gdrive folder) and after it do git clean -dxf and then dvc pull -r gremote (also I can clone the git repo to another place, behaviour will be the same)
Expected: dvc progress bar appears and pull begins.
Got:

  • I must input access key again
  • After I copy the link to browser and get new access key, I get a warning
    WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
    name: plates/***/706.zip, md5: ebd148595ce875c5754e95604d05d550
    and so on and here follows list of all files that must be pulled
    ERROR: failed to pull data from the cloud - Checkout failed for following targets:
    plates/***/706.zip
    and so on
    Did you forget to fetch?

dvc fetch says the same
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:

As the result, no files are pulled

Situation 2:
I do dvc push -r gremote and then repeat it.
Expected: everything is up to date message
Got: push runs again and takes same time as on first run.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 24, 2020
@efiop
Copy link
Contributor

efiop commented Jan 24, 2020

Hi @RomanVeretenov !

We are still actively working on gdrive, and have recently released some important changes. Please try upgrading to 0.82.2 and check if the issue persists. 🙂

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 24, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 24, 2020
@shcheklein
Copy link
Member

@RomanVeretenov could you try to specify dvc remote modify gremote no_traverse true as a workaround for this "missing files" issue, please? I've seen the same problem with the team drives. Do you use a regular Google Drive, btw or something special?

@RomanVeretenov
Copy link
Author

I've seen the same problem with the team drives. Do you use a regular Google Drive, btw or something special?

Seems that it's a regular personal drive (but with unlimited storage). At least I login to regular google account via standart google login page. If you know how I could check it more detailed, please let me know.

@RomanVeretenov
Copy link
Author

So I have updated dvc to 0.82.3 and added no_traverse true

Now I'm getting following error

failed to upload '.dvc/cache/95/3c8aa52bcd20381f00f96bfd891a32' to 'gdrive://root/dvc_root/*myrepo*/95/3c8aa52bcd20381f00f96bfd891a32' - <HttpError 403 when requesting https://www.googleapis.com/drive/v2/files?q=%28%27root%27+in+parents%29+and+trashed%3Dfalse+and+title%3D%27dvc_root%27&maxResults=1&corpus=DEFAULT&supportsTeamDrives=true&includeTeamDriveItems=true&alt=json returned "User Rate Limit Exceeded. Rate of requests for user exceed configured project quota. You may consider re-evaluating expected per-user traffic to the API and adjust project quota limits accordingly. You may monitor aggregate quota usage and adjust limits in the API Console: https://console.developers.google.com/apis/api/drive.googleapis.com/quotas?project=1013921526305">

But if I follow the link, I see that quotas aren't exceeded

image

@shcheklein
Copy link
Member

@RomanVeretenov could you please try the latest version (remove no_traverse setting from the .dvc/config)? I hope we've fixed the missing files issue.

@RomanVeretenov not sure what's up with rate limits. Will try to reproduce. 👀

Btw, I would also recommend to remove the files on the google drive remote you don't need anymore (from the time you was pushing them one by one). Just create a clean one and push again, or use dvc gc -c (be careful with this, read the docs). The size (number of files) on the remote storage might affect certain operations (especially with the default no_traverse value).

@RomanVeretenov
Copy link
Author

@RomanVeretenov could you please try the latest version (remove no_traverse setting from the .dvc/config)? I hope we've fixed the missing files issue.

@RomanVeretenov not sure what's up with rate limits. Will try to reproduce. 👀

Btw, I would also recommend to remove the files on the google drive remote you don't need anymore (from the time you was pushing them one by one). Just create a clean one and push again, or use dvc gc -c (be careful with this, read the docs). The size (number of files) on the remote storage might affect certain operations (especially with the default no_traverse value).

I will try ti reproduce it on latest DVC version with new gdrive folder and without no_traverse this week and report the result

@RomanVeretenov
Copy link
Author

So it still does not work
I have recreated the gremote destination folder in gdrive (deleted it and created an empty one)

% dvc --version
0.82.9

% dvc push -r gremote
ERROR: unexpected error - <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files?q=%28%27root%27+in+parents%29+and+trashed%3Dfalse+and+title%3D%27dvc_root%27&maxResults=1&corpora=default&corpus=DEFAULT&supportsTeamDrives=true&includeTeamDriveItems=true&alt=json returned "Corpus and corpora are mutually exclusive options.">


Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@shcheklein
Copy link
Member

@RomanVeretenov could you please check what version of PyDrive2 is installed? (pip freeze ) ... it should the latest one - 1.4.4

@shcheklein
Copy link
Member

I'm still not sure what is happening wit the exceeded limits. btw ... still looking into it and trying to reproduce

@RomanVeretenov
Copy link
Author

@RomanVeretenov could you please check what version of PyDrive2 is installed? (pip freeze ) ... it should the latest one - 1.4.4

% pip freeze | grep -i pydrive
PyDrive==1.3.1
PyDrive2==1.4.0

@RomanVeretenov
Copy link
Author

@RomanVeretenov could you please check what version of PyDrive2 is installed? (pip freeze ) ... it should the latest one - 1.4.4

% pip freeze | grep -i pydrive
PyDrive==1.3.1
PyDrive2==1.4.0

After updating PyDrive2 to 1.4.4, dvc push -r gremote has sucessfully started. But it often reports following error:

failed to upload '.dvc/cache/4a/dd94f47d5a6f3328fdda6ff6d62fc1' to 'gdrive://root/dvc_root/inex_plates/4a/dd94f47d5a6f3328fdda6ff6d62fc1' - [Errno 101] Network is unreachable

and the cache value differ always. I have limited the --jobs to 4, but it does not help.

I'm working on a remote Ubuntu pc via ssh, the pc is located somewhere in USA, but I newer faced any problems with network connection on this pc. Usual gdrive upload/download speed it there 10 MBytes/sec.

@shcheklein
Copy link
Member

@RomanVeretenov thanks, I think I know what the problem is with the "Network is unreachable" (I hit it myself already, it leaks connections) and I know already how fix this, we will prepare and release and new version this weekend. Thank you for the feedback and your patience.

@shcheklein
Copy link
Member

@RomanVeretenov I think, @efiop has released a new version of the DVC (and the PyDrive) - it should be more stable. Would you mind to give it a try again, please?

There will be the next iteration released soon, with some major performance improvements and stability when we deal with a lot of files, dvc gc support, etv.

@RomanVeretenov
Copy link
Author

RomanVeretenov commented Feb 11, 2020

I have updated dvc and pydrive

 % pip freeze | grep -i -E "dvc|pydrive"
dvc==0.83.0
PyDrive==1.3.1
PyDrive2==1.4.5

But it still doesn't work as expected

I do dvc push -r gremote, then make a clean copy of git repo and do dvc pull -r gremote

What I get is a huge list of missing caches like

WARNING: Cache 'bbbc75ce995855d3e965f958fbd584bb' not found. File 'plates/yolo/499.zip' won't be created.
...
Did you forget to fetch?`

dvc fetch -r gremote does not help also. It shows a huge list of pairs "cache and file"

name: plates/yolo/499.zip, md5: bbbc75ce995855d3e965f958fbd584bb`
...
Everything is up to date.

Also I can't find the 'bbbc75ce995855d3e965f958fbd584bb' in my google drive.

@shcheklein
Copy link
Member

@RomanVeretenov what does dvc push output? Could you run it with dvc push -v, please on a clean remote storage?

Also I can't find the 'bbbc75ce995855d3e965f958fbd584bb' in my google drive.

It should be something like <path-to-storage>/bb/bc75ce995855d3e965f958fbd584bb - could you check that please?

@shcheklein
Copy link
Member

@RomanVeretenov also, we've just released a new version with some major changes to the GDrive, could you please try to check it out as well.

@RomanVeretenov
Copy link
Author

@RomanVeretenov what does dvc push output? Could you run it with dvc push -v, please on a clean remote storage?

Also I can't find the 'bbbc75ce995855d3e965f958fbd584bb' in my google drive.

It should be something like <path-to-storage>/bb/bc75ce995855d3e965f958fbd584bb - could you check that please?

bc75ce995855d3e965f958fbd584bb persist in gdrive

@shcheklein
Copy link
Member

@RomanVeretenov and at the same moment if you run dvc pull it complains that this file is not found?

@RomanVeretenov
Copy link
Author

@RomanVeretenov and at the same moment if you run dvc pull it complains that this file is not found?

Yes. I think there was a mess with local remote and gremote. I will try once more with clean gdrive storage paying attention to not ot forget to use -r option

@shcheklein
Copy link
Member

Ok, closing this one. We've fixed a bunch of issues here and it should be way more stable. The next one is to try to reproduce and fix #3098

Then optimizations to make it work way faster with large amounts of files - V3 API, etc.

@RomanVeretenov please keep us updates about your experience with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

3 participants