Skip to content

dvc.api.get_url: it doesn't find "md5" tag #7977

Closed
@nelsoncardenas

Description

@nelsoncardenas

Bug Report

Description

I'm trying to use dvc.api.get_url to read a DataFrame in Google Cloud Storage, but the dvc file didn't create an md5 keyword, but an "etag" keyword in the .csv.dvc file. The python command api.get_url() looks for an "md5" keyword and generates an error.

Reproduce

these are the steps I followed in macos with zsh:

  1. python -m venv deploy_env
  2. source deploy_env/bin/activate
  3. python -m pip install --upgrade pip
  4. Create these requirements in path requirements/dev.txt
# prod
pandas~=1.4.3
scikit-learn~=1.1.1
matplotlib~=3.5.2
seaborn~=0.11.2
dvc~=2.12.0
dvc[gs]
pyyaml

# dev
jupyter
flake8~=4.0.1
black~=22.6.0
  1. pip install -r requirements/dev.txt
  2. dvc init
  3. dvc remote add dataset-track gs://model-data-tracker-775/dataset
  4. dvc remote add model-track gs://model-data-tracker-775/model
  5. dvc add dataset/finantials.csv --to-remote -r dataset-track
  6. dvc add model/model.pkl --to-remote -r model-track
  7. It generates a .dvc file like this one:
outs:
- etag: 4a265d63db921dcca137b881e47eab6d
  size: 593277
  path: finantials.csv
  1. Create this script in the src/prepare.py path:
from dvc import api
import pandas as pd


finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-track")
finantials_df = pd.read_csv(finantials_data_path)
  1. rm dataset/finantials.csv
  2. python src/prepare.py

Expected

An error looking for an unexistent "md5" keyword:

 $ python src/prepare.py
Traceback (most recent call last):
  File "../src/prepare.py", line 5, in <module>
    finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-tracker")
  File "../deploy_env/lib/python3.9/site-packages/dvc/api/data.py", line 31, in get_url
    md5 = dvc_info["md5"]
KeyError: 'md5'

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.12.0 (pip)
---------------------------------
Platform: Python 3.9.12 on macOS-12.4-arm64-arm-64bit
Supports:
        gs (gcsfs = 2022.5.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: gs, gs
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Additional Information (if any):
When I perform the pull command, it doesn't give me any error:

$ dvc pull dataset/finantials.csv.dvc -r dataset-track            
A       dataset/finantials.csv                                         
1 file added and 1 file fetched  

Metadata

Metadata

Assignees

Labels

A: apiRelated to the dvc.apibugDid we break something?p1-importantImportant, aka current backlog of things to doregressionOhh, we broke something :-(

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions