Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc.api.get_url: it doesn't find "md5" tag #7977

Closed
nelsoncardenas opened this issue Jul 5, 2022 · 5 comments · Fixed by #9097
Closed

dvc.api.get_url: it doesn't find "md5" tag #7977

nelsoncardenas opened this issue Jul 5, 2022 · 5 comments · Fixed by #9097
Assignees
Labels
A: api Related to the dvc.api bug Did we break something? p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(

Comments

@nelsoncardenas
Copy link

nelsoncardenas commented Jul 5, 2022

Bug Report

Description

I'm trying to use dvc.api.get_url to read a DataFrame in Google Cloud Storage, but the dvc file didn't create an md5 keyword, but an "etag" keyword in the .csv.dvc file. The python command api.get_url() looks for an "md5" keyword and generates an error.

Reproduce

these are the steps I followed in macos with zsh:

  1. python -m venv deploy_env
  2. source deploy_env/bin/activate
  3. python -m pip install --upgrade pip
  4. Create these requirements in path requirements/dev.txt
# prod
pandas~=1.4.3
scikit-learn~=1.1.1
matplotlib~=3.5.2
seaborn~=0.11.2
dvc~=2.12.0
dvc[gs]
pyyaml

# dev
jupyter
flake8~=4.0.1
black~=22.6.0
  1. pip install -r requirements/dev.txt
  2. dvc init
  3. dvc remote add dataset-track gs://model-data-tracker-775/dataset
  4. dvc remote add model-track gs://model-data-tracker-775/model
  5. dvc add dataset/finantials.csv --to-remote -r dataset-track
  6. dvc add model/model.pkl --to-remote -r model-track
  7. It generates a .dvc file like this one:
outs:
- etag: 4a265d63db921dcca137b881e47eab6d
  size: 593277
  path: finantials.csv
  1. Create this script in the src/prepare.py path:
from dvc import api
import pandas as pd


finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-track")
finantials_df = pd.read_csv(finantials_data_path)
  1. rm dataset/finantials.csv
  2. python src/prepare.py

Expected

An error looking for an unexistent "md5" keyword:

 $ python src/prepare.py
Traceback (most recent call last):
  File "../src/prepare.py", line 5, in <module>
    finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-tracker")
  File "../deploy_env/lib/python3.9/site-packages/dvc/api/data.py", line 31, in get_url
    md5 = dvc_info["md5"]
KeyError: 'md5'

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.12.0 (pip)
---------------------------------
Platform: Python 3.9.12 on macOS-12.4-arm64-arm-64bit
Supports:
        gs (gcsfs = 2022.5.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: gs, gs
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):
When I perform the pull command, it doesn't give me any error:

$ dvc pull dataset/finantials.csv.dvc -r dataset-track            
A       dataset/finantials.csv                                         
1 file added and 1 file fetched  
@daavoo daavoo added A: api Related to the dvc.api bug and removed bug labels Jul 6, 2022
@omesser omesser added bug Did we break something? and removed bug labels Dec 14, 2022
@luialopezg
Copy link

Hi Nelson, can you resolve this bug? I have the same problem

@luialopezg
Copy link

I could resove the problem. This work for me:
fsspec==2022.7.1
dvc==2.12.0

@dberenbaum dberenbaum added this to DVC Feb 23, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Feb 23, 2023
@dberenbaum
Copy link
Collaborator

I don't even think this is really a dvc.api.get_url problem, but a problem with dvc add --to-remote using a cloud remote.

Here's the example from the docs but with a cloud remote:

$ dvc remote add -d -f cloud s3://dave-sandbox-versioning/test/
$ dvc add https://data.dvc.org/get-started/data.xml --to-remote
$ dvc pull -v
2023-02-28 07:46:27,683 DEBUG: v2.45.2.dev46+g35b648bbc, CPython 3.10.2 on macOS-13.1-arm64-arm-64bit
2023-02-28 07:46:27,683 DEBUG: command: /Users/dave/miniforge3/envs/dvc/bin/dvc pull -v
2023-02-28 07:46:28,162 WARNING: Output 'data.xml'(stage: 'data.xml.dvc') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
2023-02-28 07:46:28,167 WARNING: No file hash info found for '/Users/dave/repo/data.xml'. It won't be created.
1 file failed
2023-02-28 07:46:28,167 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/dave/repo/data.xml
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/pull.py", line 47, in pull
    stats = self.checkout(
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/checkout.py", line 109, in checkout
    raise CheckoutError(stats["failed"], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
/Users/dave/repo/data.xml

Also, there's a question of what -r should do in the original post above. It doesn't save that remote info to the .dvc file, but it probably should.

@pmrowla pmrowla moved this from Backlog to Todo in DVC Feb 28, 2023
@pmrowla pmrowla added the p1-important Important, aka current backlog of things to do label Feb 28, 2023
@efiop efiop self-assigned this Feb 28, 2023
@efiop efiop added the regression Ohh, we broke something :-( label Feb 28, 2023
@efiop
Copy link
Contributor

efiop commented Feb 28, 2023

For the record: looks like md5 is not recorded during --to-remote, which is a regression. We might've lost it somewhere during cloud versioning work. Will take a closer look.

@efiop
Copy link
Contributor

efiop commented Mar 1, 2023

Yeah, just missing a hash_name option in Remote. The fix is simple, but need to migrate tests to dvc/testing while we are at it so we test that stuff all the time. Will send a PR soon.

efiop added a commit to efiop/dvc that referenced this issue Mar 1, 2023
efiop added a commit to efiop/dvc that referenced this issue Mar 1, 2023
efiop added a commit that referenced this issue Mar 1, 2023
@github-project-automation github-project-automation bot moved this from Todo to Done in DVC Mar 1, 2023
ginic added a commit to UMassCDS/IHOP-Reddit that referenced this issue Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api bug Did we break something? p1-important Important, aka current backlog of things to do regression Ohh, we broke something :-(
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants