Closed
Description
Bug Report
Description
I'm trying to use dvc.api.get_url
to read a DataFrame in Google Cloud Storage, but the dvc file didn't create an md5 keyword, but an "etag" keyword in the .csv.dvc
file. The python command api.get_url()
looks for an "md5" keyword and generates an error.
Reproduce
these are the steps I followed in macos with zsh:
python -m venv deploy_env
source deploy_env/bin/activate
python -m pip install --upgrade pip
Create these requirements in path requirements/dev.txt
# prod
pandas~=1.4.3
scikit-learn~=1.1.1
matplotlib~=3.5.2
seaborn~=0.11.2
dvc~=2.12.0
dvc[gs]
pyyaml
# dev
jupyter
flake8~=4.0.1
black~=22.6.0
pip install -r requirements/dev.txt
dvc init
dvc remote add dataset-track gs://model-data-tracker-775/dataset
dvc remote add model-track gs://model-data-tracker-775/model
dvc add dataset/finantials.csv --to-remote -r dataset-track
dvc add model/model.pkl --to-remote -r model-track
- It generates a .dvc file like this one:
outs:
- etag: 4a265d63db921dcca137b881e47eab6d
size: 593277
path: finantials.csv
- Create this script in the
src/prepare.py
path:
from dvc import api
import pandas as pd
finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-track")
finantials_df = pd.read_csv(finantials_data_path)
rm dataset/finantials.csv
python src/prepare.py
Expected
An error looking for an unexistent "md5" keyword:
$ python src/prepare.py
Traceback (most recent call last):
File "../src/prepare.py", line 5, in <module>
finantials_data_path = api.get_url("dataset/finantials.csv", remote="dataset-tracker")
File "../deploy_env/lib/python3.9/site-packages/dvc/api/data.py", line 31, in get_url
md5 = dvc_info["md5"]
KeyError: 'md5'
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.12.0 (pip)
---------------------------------
Platform: Python 3.9.12 on macOS-12.4-arm64-arm-64bit
Supports:
gs (gcsfs = 2022.5.0),
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.5.1),
https (aiohttp = 3.8.1, aiohttp-retry = 2.5.1)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: gs, gs
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Additional Information (if any):
When I perform the pull
command, it doesn't give me any error:
$ dvc pull dataset/finantials.csv.dvc -r dataset-track
A dataset/finantials.csv
1 file added and 1 file fetched
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done