Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: rate limit not handled with HTTP remote #50

Open
sisp opened this issue May 30, 2023 · 14 comments
Open

pull: rate limit not handled with HTTP remote #50

sisp opened this issue May 30, 2023 · 14 comments

Comments

@sisp
Copy link
Contributor

sisp commented May 30, 2023

Bug Report

Description

dvc pull aborts with an error when the rate limit of an HTTP remote kicks in. This is problematic because it's impossible to pull the complete data.

Reproduce

Here is a Starlette-based web app which implements a simple HTTP backend for DVC including a rate limit (1 request per minute, just to make sure it kicks in) for GET requests. Copy this code to, e.g., app.py.

from pathlib import Path
from typing import Tuple

from ratelimit import RateLimitMiddleware, Rule
from ratelimit.backends.simple import MemoryBackend
from ratelimit.types import Scope
from starlette.applications import Starlette
from starlette.requests import Request
from starlette.responses import FileResponse, Response

UPLOAD_ROOT_DIR = Path.cwd() / "dvc-backend"


async def auth(scope: Scope) -> Tuple[str, str]:
    return "user", "default"


async def upload(request: Request) -> Response:
    path = request.path_params["path"]
    upload_path = UPLOAD_ROOT_DIR / path
    upload_path.parent.mkdir(parents=True, exist_ok=True)
    with upload_path.open("wb") as f:
        async for chunk in request.stream():
            f.write(chunk)
    return Response(status_code=201)


async def download(request: Request) -> Response:
    path = request.path_params["path"]
    upload_path = UPLOAD_ROOT_DIR / path
    if not upload_path.exists():
        return Response(status_code=404)
    return FileResponse(upload_path)


app = Starlette(debug=True)
app.add_route("/{path:path}", upload, ["POST"])
app.add_route("/{path:path}", download, ["GET"])
app = RateLimitMiddleware(
    app,
    auth,
    MemoryBackend(),
    {
        r"^/": [Rule(minute=1, method="get")],
    },
)

Example:

  1. Create a new virtual env and activate it:

    python -m venv .venv
    . .venv/bin/activate
  2. Install needed packages:

    pip install dvc starlette asgi-ratelimit uvicorn
  3. Initialize the project with DVC:

    dvc init
  4. Set the local web server as the DVC remote:

    dvc remote add -d local http://127.0.0.1:8000
  5. Create a folder data and add some files:

    mkdir data
    for i in {1..10}; do echo "$i" > data/data_$i.txt; done
  6. Track those files individually with DVC:

    dvc add data/*.txt
  7. In a separate terminal window, activate the virtual env and start the web app:

    . .venv/bin/activate
    uvicorn app:app
  8. Push the files to the DVC remote:

    dvc push
  9. Delete the local files and the local DVC cache:

    rm -rf data/*.txt .dvc/cache
  10. Pull the files from the DVC remote again and observe this error:

    $ dvc pull
    ERROR: failed to transfer '84bc3da1b3e33a18e8d5e1bdd7a18d7a' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/84/bc3da1b3e33a18e8d5e1bdd7a18d7a')
    ERROR: failed to transfer 'b026324c6904b2a9cb4b88d6d61c81d1' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/b0/26324c6904b2a9cb4b88d6d61c81d1')
    ERROR: failed to transfer '9ae0ea9e3c9c6e1b9b6252c8395efdc1' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/9a/e0ea9e3c9c6e1b9b6252c8395efdc1')
    ERROR: failed to transfer '1dcca23355272056f04fe8bf20edfce0' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/1d/cca23355272056f04fe8bf20edfce0')
    ERROR: failed to transfer '26ab0db90d72e28ad0ba1e22ee510510' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/26/ab0db90d72e28ad0ba1e22ee510510')
    ERROR: failed to transfer '31d30eea8d0968d6458e0ad0027c9f80' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/31/d30eea8d0968d6458e0ad0027c9f80')
    ERROR: failed to transfer 'c30f7472766d25af1dc80b3ffc9a58c7' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/c3/0f7472766d25af1dc80b3ffc9a58c7')
    ERROR: failed to transfer '7c5aba41f53293b712fd86d08ed5b36e' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/7c/5aba41f53293b712fd86d08ed5b36e')
    ERROR: failed to transfer '6d7fce9fee471194aa8b5b6e47267f03' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/6d/7fce9fee471194aa8b5b6e47267f03')
    ERROR: failed to transfer '48a24b70a0b376535542b996af517398' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/48/a24b70a0b376535542b996af517398')
    ERROR: failed to pull data from the cloud - 10 files failed to download
  11. Strangely, when running dvc pull again (while the rate limit is still active), another error occurs:

    $ dvc pull
    WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
    name: data/data_3.txt, md5: 6d7fce9fee471194aa8b5b6e47267f03
    name: data/data_9.txt, md5: 7c5aba41f53293b712fd86d08ed5b36e
    name: data/data_1.txt, md5: b026324c6904b2a9cb4b88d6d61c81d1
    name: data/data_10.txt, md5: 31d30eea8d0968d6458e0ad0027c9f80
    name: data/data_7.txt, md5: 84bc3da1b3e33a18e8d5e1bdd7a18d7a
    name: data/data_6.txt, md5: 9ae0ea9e3c9c6e1b9b6252c8395efdc1
    name: data/data_2.txt, md5: 26ab0db90d72e28ad0ba1e22ee510510
    name: data/data_4.txt, md5: 48a24b70a0b376535542b996af517398
    name: data/data_5.txt, md5: 1dcca23355272056f04fe8bf20edfce0
    name: data/data_8.txt, md5: c30f7472766d25af1dc80b3ffc9a58c7
    10 files failed
    ERROR: failed to pull data from the cloud - Checkout failed for following targets:
    .../data/data_1.txt
    .../data/data_10.txt
    .../data/data_4.txt
    .../data/data_2.txt
    .../data/data_9.txt
    .../data/data_3.txt
    .../data/data_8.txt
    .../data/data_7.txt
    .../data/data_5.txt
    .../data/data_6.txt
    Is your cache up to date?
    <https://error.dvc.org/missing-files>

Expected

DVC should inspect the retry-after field in the response header and attempt to make requests again after this time has passed.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.58.1 (pip)
-------------------------
Platform: Python 3.9.13 on Linux-5.13.0-48-generic-x86_64-with-glibc2.31
Subprojects:
	dvc_data = 0.51.0
	dvc_objects = 0.22.0
	dvc_render = 0.5.3
	dvc_task = 0.2.1
	scmrepo = 1.0.3
Supports:
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
	Global: $HOME/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https, https, http
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/25682416b4608be4a9b899b1736e0974

Additional Information (if any):

Here is the output of the two dvc pull commands with the --verbose flag:

$ dvc pull -v
2023-05-30 15:57:18,132 DEBUG: v2.58.1 (pip), CPython 3.9.13 on Linux-5.13.0-48-generic-x86_64-with-glibc2.31
2023-05-30 15:57:18,133 DEBUG: command: .venv/bin/dvc pull -v
2023-05-30 15:57:18,658 DEBUG: Preparing to transfer data from 'http://127.0.0.1:8000' to '.dvc/cache'
2023-05-30 15:57:18,658 DEBUG: Preparing to collect status from '.dvc/cache'
2023-05-30 15:57:18,659 DEBUG: Collecting status from '.dvc/cache'
2023-05-30 15:57:18,662 DEBUG: Preparing to collect status from 'http://127.0.0.1:8000'
2023-05-30 15:57:18,663 DEBUG: Collecting status from 'http://127.0.0.1:8000'
2023-05-30 15:57:18,663 DEBUG: Querying 10 oids via object_exists
2023-05-30 15:57:18,753 DEBUG: Removing '.dvc/cache/26/.6vCKFo5bUhEfpoTkxmf3Zc.tmp'
2023-05-30 15:57:18,754 ERROR: failed to transfer '26ab0db90d72e28ad0ba1e22ee510510' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/26/ab0db90d72e28ad0ba1e22ee510510')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 306, in transfer
    _try_links(
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 247, in _try_links
    return copy(from_fs, from_path, to_fs, to_path, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 93, in copy
    return _get(
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 197, in _get
    return _get_one(from_paths[0], to_paths[0])
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 189, in _get_one
    return get_file(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 69, in func
    return wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 41, in wrapped
    res = fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 550, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 115, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 100, in sync
    raise return_result
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in _runner
    result[0] = await coro
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/26/ab0db90d72e28ad0ba1e22ee510510')

2023-05-30 15:57:18,776 DEBUG: Removing '.dvc/cache/48/.Wg4wi9LR5ad4q6hRJGEQrj.tmp'
2023-05-30 15:57:18,777 DEBUG: Removing '.dvc/cache/1d/.nu6iYNSS5J3o6YG7hyGB5H.tmp'
2023-05-30 15:57:18,779 DEBUG: Removing '.dvc/cache/b0/.7N85YqAnsVAxTLTjqJ8xv3.tmp'
2023-05-30 15:57:18,780 DEBUG: Removing '.dvc/cache/31/.8kMLju7vJgFidrZN7UvZRV.tmp'
2023-05-30 15:57:18,782 DEBUG: Removing '.dvc/cache/9a/.gXNbdnwFuuiheSCGP4vABf.tmp'
2023-05-30 15:57:18,783 DEBUG: Removing '.dvc/cache/c3/.3YuTNggWC7WxTLi2BrpcVD.tmp'
2023-05-30 15:57:18,785 DEBUG: Removing '.dvc/cache/84/.hK2ttvhpToa6JUTFjPQfTy.tmp'
2023-05-30 15:57:18,786 DEBUG: Removing '.dvc/cache/6d/.U5EME8JM3Mno5xVVjTbkfu.tmp'
2023-05-30 15:57:18,787 DEBUG: Removing '.dvc/cache/7c/.2WWom85ncH7gbY2hqbKWCv.tmp'
2023-05-30 15:57:18,788 ERROR: failed to transfer '48a24b70a0b376535542b996af517398' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/48/a24b70a0b376535542b996af517398')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/48/a24b70a0b376535542b996af517398')

2023-05-30 15:57:18,788 ERROR: failed to transfer '1dcca23355272056f04fe8bf20edfce0' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/1d/cca23355272056f04fe8bf20edfce0')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/1d/cca23355272056f04fe8bf20edfce0')

2023-05-30 15:57:18,789 ERROR: failed to transfer 'b026324c6904b2a9cb4b88d6d61c81d1' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/b0/26324c6904b2a9cb4b88d6d61c81d1')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/b0/26324c6904b2a9cb4b88d6d61c81d1')

2023-05-30 15:57:18,790 ERROR: failed to transfer '31d30eea8d0968d6458e0ad0027c9f80' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/31/d30eea8d0968d6458e0ad0027c9f80')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/31/d30eea8d0968d6458e0ad0027c9f80')

2023-05-30 15:57:18,790 ERROR: failed to transfer '9ae0ea9e3c9c6e1b9b6252c8395efdc1' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/9a/e0ea9e3c9c6e1b9b6252c8395efdc1')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/9a/e0ea9e3c9c6e1b9b6252c8395efdc1')

2023-05-30 15:57:18,791 ERROR: failed to transfer 'c30f7472766d25af1dc80b3ffc9a58c7' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/c3/0f7472766d25af1dc80b3ffc9a58c7')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/c3/0f7472766d25af1dc80b3ffc9a58c7')

2023-05-30 15:57:18,791 ERROR: failed to transfer '84bc3da1b3e33a18e8d5e1bdd7a18d7a' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/84/bc3da1b3e33a18e8d5e1bdd7a18d7a')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/84/bc3da1b3e33a18e8d5e1bdd7a18d7a')

2023-05-30 15:57:18,792 ERROR: failed to transfer '6d7fce9fee471194aa8b5b6e47267f03' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/6d/7fce9fee471194aa8b5b6e47267f03')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/6d/7fce9fee471194aa8b5b6e47267f03')

2023-05-30 15:57:18,793 ERROR: failed to transfer '7c5aba41f53293b712fd86d08ed5b36e' - 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/7c/5aba41f53293b712fd86d08ed5b36e')
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_objects/executors.py", line 132, in batch_coros
    result = fut.result()
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/generic.py", line 207, in _get_one_coro
    return await get_coro(from_path, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 84, in func
    return await wrapped(path1, path2, **kw)
  File ".venv/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 52, in wrapped
    res = await fn(*args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 248, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 214, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('http://127.0.0.1:8000/7c/5aba41f53293b712fd86d08ed5b36e')

2023-05-30 15:57:18,793 DEBUG: failed to protect '.dvc/cache/26/ab0db90d72e28ad0ba1e22ee510510' - [Errno 2] No such file or directory: '.dvc/cache/26/ab0db90d72e28ad0ba1e22ee510510'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/26/ab0db90d72e28ad0ba1e22ee510510'

2023-05-30 15:57:18,794 DEBUG: failed to protect '.dvc/cache/48/a24b70a0b376535542b996af517398' - [Errno 2] No such file or directory: '.dvc/cache/48/a24b70a0b376535542b996af517398'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/48/a24b70a0b376535542b996af517398'

2023-05-30 15:57:18,795 DEBUG: failed to protect '.dvc/cache/1d/cca23355272056f04fe8bf20edfce0' - [Errno 2] No such file or directory: '.dvc/cache/1d/cca23355272056f04fe8bf20edfce0'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/1d/cca23355272056f04fe8bf20edfce0'

2023-05-30 15:57:18,796 DEBUG: failed to protect '.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1' - [Errno 2] No such file or directory: '.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/b0/26324c6904b2a9cb4b88d6d61c81d1'

2023-05-30 15:57:18,797 DEBUG: failed to protect '.dvc/cache/31/d30eea8d0968d6458e0ad0027c9f80' - [Errno 2] No such file or directory: '.dvc/cache/31/d30eea8d0968d6458e0ad0027c9f80'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/31/d30eea8d0968d6458e0ad0027c9f80'

2023-05-30 15:57:18,798 DEBUG: failed to protect '.dvc/cache/9a/e0ea9e3c9c6e1b9b6252c8395efdc1' - [Errno 2] No such file or directory: '.dvc/cache/9a/e0ea9e3c9c6e1b9b6252c8395efdc1'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/9a/e0ea9e3c9c6e1b9b6252c8395efdc1'

2023-05-30 15:57:18,798 DEBUG: failed to protect '.dvc/cache/c3/0f7472766d25af1dc80b3ffc9a58c7' - [Errno 2] No such file or directory: '.dvc/cache/c3/0f7472766d25af1dc80b3ffc9a58c7'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/c3/0f7472766d25af1dc80b3ffc9a58c7'

2023-05-30 15:57:18,799 DEBUG: failed to protect '.dvc/cache/84/bc3da1b3e33a18e8d5e1bdd7a18d7a' - [Errno 2] No such file or directory: '.dvc/cache/84/bc3da1b3e33a18e8d5e1bdd7a18d7a'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/84/bc3da1b3e33a18e8d5e1bdd7a18d7a'

2023-05-30 15:57:18,799 DEBUG: failed to protect '.dvc/cache/6d/7fce9fee471194aa8b5b6e47267f03' - [Errno 2] No such file or directory: '.dvc/cache/6d/7fce9fee471194aa8b5b6e47267f03'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/6d/7fce9fee471194aa8b5b6e47267f03'

2023-05-30 15:57:18,799 DEBUG: failed to protect '.dvc/cache/7c/5aba41f53293b712fd86d08ed5b36e' - [Errno 2] No such file or directory: '.dvc/cache/7c/5aba41f53293b712fd86d08ed5b36e'
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc_data/hashfile/db/local.py", line 119, in protect
    os.chmod(path, self.CACHE_MODE)
FileNotFoundError: [Errno 2] No such file or directory: '.dvc/cache/7c/5aba41f53293b712fd86d08ed5b36e'

2023-05-30 15:57:18,817 ERROR: failed to pull data from the cloud - 10 files failed to download
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File ".venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 35, in pull
    processed_files_count = self.fetch(
  File ".venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc/repo/fetch.py", line 125, in fetch
    raise DownloadError(failed_count)
dvc.exceptions.DownloadError: 10 files failed to download

2023-05-30 15:57:18,820 DEBUG: Analytics is enabled.
2023-05-30 15:57:18,876 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp01ul_m25']'
2023-05-30 15:57:18,877 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp01ul_m25']'
$ dvc pull -v
2023-05-30 15:57:35,317 DEBUG: v2.58.1 (pip), CPython 3.9.13 on Linux-5.13.0-48-generic-x86_64-with-glibc2.31
2023-05-30 15:57:35,317 DEBUG: command: .venv/bin/dvc pull -v
2023-05-30 15:57:35,926 DEBUG: Preparing to transfer data from 'http://127.0.0.1:8000' to '.dvc/cache'
2023-05-30 15:57:35,926 DEBUG: Preparing to collect status from '.dvc/cache'
2023-05-30 15:57:35,927 DEBUG: Collecting status from '.dvc/cache'
2023-05-30 15:57:35,930 DEBUG: Preparing to collect status from 'http://127.0.0.1:8000'
2023-05-30 15:57:35,930 DEBUG: Collecting status from 'http://127.0.0.1:8000'
2023-05-30 15:57:35,931 DEBUG: Querying 10 oids via object_exists
2023-05-30 15:57:36,021 WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: data/data_8.txt, md5: c30f7472766d25af1dc80b3ffc9a58c7
name: data/data_3.txt, md5: 6d7fce9fee471194aa8b5b6e47267f03
name: data/data_10.txt, md5: 31d30eea8d0968d6458e0ad0027c9f80
name: data/data_1.txt, md5: b026324c6904b2a9cb4b88d6d61c81d1
name: data/data_9.txt, md5: 7c5aba41f53293b712fd86d08ed5b36e
name: data/data_7.txt, md5: 84bc3da1b3e33a18e8d5e1bdd7a18d7a
name: data/data_4.txt, md5: 48a24b70a0b376535542b996af517398
name: data/data_5.txt, md5: 1dcca23355272056f04fe8bf20edfce0
name: data/data_6.txt, md5: 9ae0ea9e3c9c6e1b9b6252c8395efdc1
name: data/data_2.txt, md5: 26ab0db90d72e28ad0ba1e22ee510510
2023-05-30 15:57:36,046 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,047 DEBUG: Removing 'data/.FvXRuHSCcQHNqSq7z6DyFF.tmp'
2023-05-30 15:57:36,048 DEBUG: Removing 'data/.FvXRuHSCcQHNqSq7z6DyFF.tmp'
2023-05-30 15:57:36,048 DEBUG: Removing '.dvc/cache/.BuFCdoskczjJpnC28TozdV.tmp'
2023-05-30 15:57:36,052 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,053 DEBUG: Removing 'data/.DpTnPhedxstHQPHHVFAXSZ.tmp'
2023-05-30 15:57:36,054 DEBUG: Removing 'data/.DpTnPhedxstHQPHHVFAXSZ.tmp'
2023-05-30 15:57:36,054 DEBUG: Removing '.dvc/cache/.j9BqZ2pbEoKGZzE6EfbcAn.tmp'
2023-05-30 15:57:36,058 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,058 DEBUG: Removing 'data/.c5GocTuyXV7BGZ4mTc7Z94.tmp'
2023-05-30 15:57:36,059 DEBUG: Removing 'data/.c5GocTuyXV7BGZ4mTc7Z94.tmp'
2023-05-30 15:57:36,060 DEBUG: Removing '.dvc/cache/.Mvrtu4W4dPZnrXEWHyZSqa.tmp'
2023-05-30 15:57:36,063 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,064 DEBUG: Removing 'data/.XPvuqUEvc4MJoN2syaPjYR.tmp'
2023-05-30 15:57:36,064 DEBUG: Removing 'data/.XPvuqUEvc4MJoN2syaPjYR.tmp'
2023-05-30 15:57:36,065 DEBUG: Removing '.dvc/cache/.YwviXYUbS6NFk62YKDvTLH.tmp'
2023-05-30 15:57:36,069 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,070 DEBUG: Removing 'data/.KZKwNL9AkEhLeVmaFaGko8.tmp'
2023-05-30 15:57:36,071 DEBUG: Removing 'data/.KZKwNL9AkEhLeVmaFaGko8.tmp'
2023-05-30 15:57:36,071 DEBUG: Removing '.dvc/cache/.EjsvCrFxy5p6ohALK3AaDa.tmp'
2023-05-30 15:57:36,076 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,077 DEBUG: Removing 'data/.YMsSX7eCVcyXfAWtWK44Pu.tmp'
2023-05-30 15:57:36,078 DEBUG: Removing 'data/.YMsSX7eCVcyXfAWtWK44Pu.tmp'
2023-05-30 15:57:36,078 DEBUG: Removing '.dvc/cache/.S2kSBiWXRPxt9kSYGjJa7P.tmp'
2023-05-30 15:57:36,082 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,083 DEBUG: Removing 'data/.WdZuyA7QDyLvh94nFaaurA.tmp'
2023-05-30 15:57:36,084 DEBUG: Removing 'data/.WdZuyA7QDyLvh94nFaaurA.tmp'
2023-05-30 15:57:36,084 DEBUG: Removing '.dvc/cache/.2nxHqK3jyEimPCUaH5UYyj.tmp'
2023-05-30 15:57:36,087 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,088 DEBUG: Removing 'data/.k6Qqg2PWAb2aXz8CuCEm8z.tmp'
2023-05-30 15:57:36,089 DEBUG: Removing 'data/.k6Qqg2PWAb2aXz8CuCEm8z.tmp'
2023-05-30 15:57:36,089 DEBUG: Removing '.dvc/cache/.NnNyqdqNUBxg9oirW8bui2.tmp'
2023-05-30 15:57:36,092 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,093 DEBUG: Removing 'data/.5tAH8NDCP6WsEufeDyaNLZ.tmp'
2023-05-30 15:57:36,094 DEBUG: Removing 'data/.5tAH8NDCP6WsEufeDyaNLZ.tmp'
2023-05-30 15:57:36,094 DEBUG: Removing '.dvc/cache/.ijfp6wqNhLjd5s7YgrCNrw.tmp'
2023-05-30 15:57:36,097 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-05-30 15:57:36,098 DEBUG: Removing 'data/.Di5FPPJjszPLisiQnCJzni.tmp'
2023-05-30 15:57:36,099 DEBUG: Removing 'data/.Di5FPPJjszPLisiQnCJzni.tmp'
2023-05-30 15:57:36,099 DEBUG: Removing '.dvc/cache/.MJgZ5LhmrFGLgyhocSv9gc.tmp'
10 files failed
2023-05-30 15:57:36,101 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
data/data_4.txt
data/data_6.txt
data/data_9.txt
data/data_7.txt
data/data_2.txt
data/data_1.txt
data/data_10.txt
data/data_5.txt
data/data_8.txt
data/data_3.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File ".venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 47, in pull
    stats = self.checkout(
  File ".venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 65, in wrapper
    return f(repo, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc/repo/checkout.py", line 109, in checkout
    raise CheckoutError(stats["failed"], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
data/data_4.txt
data/data_6.txt
data/data_9.txt
data/data_7.txt
data/data_2.txt
data/data_1.txt
data/data_10.txt
data/data_5.txt
data/data_8.txt
data/data_3.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>

2023-05-30 15:57:36,105 DEBUG: Analytics is enabled.
2023-05-30 15:57:36,162 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmph5guxsgc']'
2023-05-30 15:57:36,164 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmph5guxsgc']

It may be possible that the problem needs to be solved within fsspec, but I think raising this issue here is a good starting point for the initial discussion, to track it, and to reduce duplicate issues.

@daavoo
Copy link

daavoo commented Jun 6, 2023

Hi @sisp , thanks for the detailed report.

The request makes sense but we discussed internally that implementing the right solution at the right level might be tricky.

In your specific case, what is the request limit on the server?
Have you tried to reduce the number of --jobs in pull as a workaround?

@sisp
Copy link
Contributor Author

sisp commented Jun 6, 2023

@daavoo The rate limit is 1000 requests per 15 seconds. To reduce the risk of running into the limit, we use --jobs 1, but that's not a reliable workaround because it merely slows down the download but doesn't handle rate limit responses when they occur.

Do you know where the problem would need to be fixed? In dvc or one of the dvc-* packages, or in fsspec, or elsewhere?

@daavoo
Copy link

daavoo commented Jun 6, 2023

Do you know where the problem would need to be fixed? In dvc or one of the dvc-* packages, or in fsspec, or elsewhere?

It is unclear if it would be an acceptable or simple change to make in fsspec.

I haven't look in details, but it would probably make sense to handle it in dvc-http by extending/modifying:

https://github.com/iterative/dvc-http/blob/2cd12c235435e52813360cdf05c7ba18ea01337c/dvc_http/__init__.py#L100C46-L105

@sisp
Copy link
Contributor Author

sisp commented Jun 6, 2023

Okay. If you don't have the capacity to make this change, I could look into it and attempt a contribution if somebody could review it then.

@efiop
Copy link
Contributor

efiop commented Jun 24, 2023

@sisp You should be able to just run dvc fetch/pull in a loop until it no longer errors out. It is designed to recover from any errors and download only the missing leftover during next runs.

@efiop
Copy link
Contributor

efiop commented Jun 24, 2023

Also, http remote is very basic and we really didn't want to try to support all custom https servers with their quirks. Maybe you could consider using some s3-compatible server for handling the data? (e.g. minio).

@efiop efiop transferred this issue from iterative/dvc Jun 24, 2023
@shcheklein
Copy link
Member

@efiop is right about simplicity and that fetch / pull can be used in a loop. We have though an example (GDrvive) where the whole implementation is like 10 lines of code (+adding a wrapper on top of operations) - https://github.com/iterative/PyDrive2/blob/main/pydrive2/test/test_util.py#L46-L54

The problem here is that HTTP is a general remote, and 429 is being the most common code for rate limiting, can be not the only one. I'm not sure how often servers use 429, is it always safe to retry in those cases, etc. @sisp do you have some information about that?

Overall, I feel that it can be a convenience feature if 429 is indeed used very often and it's always safe to retry on it.

@skshetry
Copy link
Member

skshetry commented Jun 24, 2023

There's also a question whether dvc should respect that at all. I'd say dvc should be aggressive on downloads/uploads and provide means to control that and it does so via --jobs.

@sisp
Copy link
Contributor Author

sisp commented Jun 24, 2023

@sisp You should be able to just run dvc fetch/pull in a loop until it no longer errors out. It is designed to recover from any errors and download only the missing leftover during next runs.

That DX is awful, both locally and especially in a CI job. It's at best a poor workaround.

Also, http remote is very basic and we really didn't want to try to support all custom https servers with their quirks. Maybe you could consider using some s3-compatible server for handling the data? (e.g. minio).

No, let me elaborate. My DVC remote is GitLab's generic package registry on our corporate self-hosted GitLab instance. This approach has significant advantages over, e.g., S3:

  • Unified access and permissions management through GitLab's project/group membership management system where member roles imply also access permissions on the DVC remote (i.e. the generic package registry)

  • Usage of well-known GitLab access tokens:

    • PAT for user access (typically locally and interactively)
    • Deploy token for programmatic access from a service
    • CI job token for access in a CI job (which implicitly inherits the access permissions of the user triggering a pipeline)
  • No need for extra infrastructure, not even managed one like S3

  • No need for complex identity federation

  • Inheritance of security features from our self-hosted GitLab instance (e.g. impossible to set public access to the package registry)

See also a feature request for first-class support of DVC in GitLab that I've submitted.

The problem here is that HTTP is a general remote, and 429 is being the most common code for rate limiting, can be not the only one. I'm not sure how often servers use 429, is it always safe to retry in those cases, etc. @sisp do you have some information about that?

HTTP 429 looks pretty standard to me. I think it should be supported; custom rate limit responses not though.

There's also a question whether dvc should respect that at all. I'd say dvc should be aggressive on downloads/uploads and provide means to control that and it does so via --jobs.

That's just a hack and not reliable. See also my remark about DX.

@shcheklein
Copy link
Member

No, let me elaborate. My DVC remote is GitLab's generic package registry on our corporate self-hosted GitLab instance. This approach has significant advantages over, e.g., S3:

Sounds really great to me!

Is the rate limiting the only missing piece for this? Does auth come "automatically" in this case? Do GET and POST work out-of-the-box?

See also a feature request for first-class support of DVC in GitLab that I've submitted.

thanks! also a good feature request.

HTTP 429 looks pretty standard to me. I think it should be supported; custom rate limit responses not though.

yes, it makes sense, I'm just curious about some real world stats on this (e.g. in GDrive as you can see it's not 429).

overall, again, I think if we can make the better integration with GitLab storage with this and the scope is not huge it's worth adding this as a convenience feature. I feel we would need to make it an optional though, to be safe.

@skshetry
Copy link
Member

skshetry commented Jun 24, 2023

@sisp, does this work in 3.0? We changed the remote structure from root <remote_url>/ to <remote_url>/files/md5/.

@sisp
Copy link
Contributor Author

sisp commented Jun 24, 2023

Is the rate limiting the only missing piece for this?

Rate limiting and per-file size limit, although the latter is typically not problematic because large files harm deduplication and thus should be avoided.

Does auth come "automatically" in this case? Do GET and POST work out-of-the-box?

Yes, auth comes automatically. Only method needs to be set to PUT and the auth header needs to match the token type. It's really smooth.

@sisp, does this work in 3.0? We changed the remote structure from root <remote_url>/ to <remote_url>/files/md5/.

🙈 Oh no! No, this doesn't work because now there are 4 path elements and GitLab's generic package registry supports only 3. Before, the cache structure was mapped to the generic package registry API URL path template

/projects/:id/packages/generic/:package_name/:package_version/:file_name

as follows:

  • :package_name: dvc (or any other arbitrary name)
  • :package_version: The first 2 characters of a file's content hash
  • :file_name: The remaining 30 characters of a file's content hash

Is there any way to configure a mapping between the local cache and the remote cache? We'd need to strip /files/md5 or map it to, e.g., /dvc-files-md5.

Or would you be open to having a new remote plugin for GitLab which would be based on the HTTP remote plugin but could handle the peculiarities like this path mapping? I'd contribute it, I really don't want to loose the GitLab integration.

@shcheklein
Copy link
Member

Or would you be open to having a new remote plugin for GitLab which would be based on the HTTP remote plugin but could handle the peculiarities like this path mapping? I'd contribute it, I really don't want to loose the GitLab integration.

Yes, I think it can work. They all share the same structure and I think nothing prevents from creating a new remote type.

@sisp
Copy link
Contributor Author

sisp commented Jun 27, 2023

Nice! I'll keep you posted on the DVC GitLab plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants