Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push/pull: fails to finish querying the remote (azure blob) #7337

Closed
rubenpraets opened this issue Feb 2, 2022 · 8 comments
Closed

push/pull: fails to finish querying the remote (azure blob) #7337

rubenpraets opened this issue Feb 2, 2022 · 8 comments
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: azure Related to the Azure filesystem research

Comments

@rubenpraets
Copy link

Bug Report

Description

When trying to push a new directory structure containing relatively many small files (e.g. 18k files of ~50 kB) to our azure blob storage remote (containing ~1.2 million files), the operation quite reliably fails while querying the remote for existing hashes. See below for a full stack trace.

On my colleagues pc, the error always happens after say 6-8 minutes of happily querying away, while the operation is nearly finished. Occasionally such a push succeeds, but that is maybe once for every 10 tries, if not less often.
On my own pc I have had the same problem in the past (be it less reliably), but right now I am unable to reproduce the issue. On my pc the query step went a little faster though and completed in about 5 minutes.

My colleague also reports this problem when trying to pull data from the remote, where it also fails while querying for hashes.

Reproduce

Probably hard to reproduce, but say that ./data is a directory with a lot of new files to push to a remote containing many more files already:

  1. dvc add ./data
  2. dvc push ./data.dvc

Expected

The new data is successfully pushed to the remote.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.8 on Windows-10-10.0.22000-SP0
Supports:
        azure (adlfs = 2021.10.0, knack = 0.8.2, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.8.3),
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink
Cache directory: NTFS on C:\
Caches: local
Remotes: azure
Workspace directory: NTFS on C:\
Repo: dvc, git

Additional Information (if any):

We have already tried to upgrade the libraries involved in the stacktrace to the newest versions, to no avail. If anything this has led to more errors being printed about unavailable link types and the like. I however don't immediately think these are related to the above problem.

On the aiohttp github there are a number of issues regarding the same error we get aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed. In particular the last part of the discussion at aio-libs/aiohttp#3904 caught my eye, as there is mention of both azure blob storage and connections that are closed after 5 minutes. The supposed fix is to add the Connection: keep-alive header to your request, so I went digging. We altered the aiohttp code to add this header to all requests. At first glance this seemed to have solved the problem, but sadly it is back again. I don't know if it would be possible to do this in a cleaner way from the dvc code, but I'm willing to try if it works.

As a last resort I told my colleague to push new files in smaller chunks. This triggers dvc to query single hashes from the remote (via object_exists) instead of listing it entirely (via traverse), which seems to work for now. I'm not sure though if this method would encounter the same problem if it were to query a larger amount of files this way and go over the 5 minute mark.

Anyway, my colleague is now forced to upload data in chunks of ~2000 files, which is really not practical considering that he routinely needs to upload tens of thousands of files. I hope you can help us out, but if not we would appreciate any pointers you could give to where we might find help.

verbose output from a dvc push (after updating libraries):

2022-01-31 14:57:41,969 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\config.local' to gitignore file.
2022-01-31 14:57:41,981 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\tmp' to gitignore file.
2022-01-31 14:57:41,987 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\cache' to gitignore file.
2022-01-31 14:57:43,163 DEBUG: Preparing to transfer data from 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\cache' to 'dvc/data_v3'
2022-01-31 14:57:43,163 DEBUG: Preparing to collect status from 'dvc/data_v3'
2022-01-31 14:57:43,180 DEBUG: Collecting status from 'dvc/data_v3'
2022-01-31 14:57:51,618 DEBUG: Querying 34 hashes via object_exists
2022-01-31 14:57:53,094 DEBUG: Querying 1 hashes via object_exists
2022-01-31 14:57:56,262 DEBUG: Estimated remote size: 1241088 files
2022-01-31 14:57:56,262 DEBUG: Querying '8678' hashes via traverse
2022-01-31 15:05:28,350 ERROR: unexpected error - Response payload is not completed
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 394, in load_body
    self._content = await self.internal_response.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\client_reqrep.py", line 1036, in read
    self._body = await self.content.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 375, in read
    block = await self.readany()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 397, in readany
    await self._wait("readany")
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\main.py", line 55, in main
    ret = cmd.do_run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\base.py", line 45, in do_run
    return self.run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\data_sync.py", line 57, in run
    processed_files_count = self.repo.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\push.py", line 56, in push
    pushed += self.cloud.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\data_cloud.py", line 85, in push
    return transfer(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\transfer.py", line 153, in transfer
    status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 158, in compare_status
    dest_exists, dest_missing = status(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 131, in status
    exists.update(odb.hashes_exist(hashes, name=odb.fs_path, **kwargs))
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 499, in hashes_exist
    remote_hashes = set(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 334, in _list_hashes_traverse
    yield from itertools.chain.from_iterable(in_remote)
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 324, in list_with_update
    return list(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 215, in _list_hashes
    for path in self._list_paths(prefix, progress_callback):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 195, in _list_paths
    for file_info in self.fs.find(fs_path, prefix=prefix):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\fsspec_wrapper.py", line 179, in find
    files = self.fs.find(with_prefix, prefix=self.path.parts(path)[-1])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 965, in find
    return sync(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 71, in sync
    raise return_result
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 25, in _runner
    result[0] = await coro
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in _find
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in <listcomp>
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 154, in __anext__
    return await self.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 157, in __anext__
    self._page = await self._page_iterator.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 99, in __anext__
    self._response = await self._get_next(self.continuation_token)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 78, in _get_next_cb
    process_storage_error(error)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\response_handlers.py", line 89, in process_storage_error
    raise storage_error
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 71, in _get_next_cb
    return await self._command(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_generated\aio\operations\_container_operations.py", line 1456, in list_blob_flat_seg
ment
    pipeline_response = await self._client._pipeline.run(request, stream=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 215, in run
    return await first_node.send(pipeline_request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  [Previous line repeated 4 more times]
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\policies\_redirect_async.py", line 64, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 125, in send
    raise err
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 99, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 56, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 116, in send
    await self._sender.send(request.http_request, **request.context.options),
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\base_client_async.py", line 180, in send
    return await self._transport.send(request, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 251, in send
    await response.load_body()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 398, in load_body
    raise IncompleteReadError(err, error=err)
azure.core.exceptions.IncompleteReadError: Response payload is not completed
------------------------------------------------------------
2022-01-31 15:05:28,664 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\config.local' to gitignore file.
2022-01-31 15:05:28,681 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\tmp' to gitignore file.
2022-01-31 15:05:28,692 DEBUG: Adding 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\cache' to gitignore file.
2022-01-31 15:05:28,698 DEBUG: [Errno 129] no more link types left to try out: [Errno 129] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 129] reflink is not
 supported on 'Windows'
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 394, in load_body
    self._content = await self.internal_response.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\client_reqrep.py", line 1036, in read
    self._body = await self.content.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 375, in read
    block = await self.readany()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 397, in readany
    await self._wait("readany")
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\main.py", line 55, in main
    ret = cmd.do_run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\base.py", line 45, in do_run
    return self.run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\data_sync.py", line 57, in run
    processed_files_count = self.repo.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\push.py", line 56, in push
    pushed += self.cloud.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\data_cloud.py", line 85, in push
    return transfer(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\transfer.py", line 153, in transfer
    status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 158, in compare_status
    dest_exists, dest_missing = status(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 131, in status
    exists.update(odb.hashes_exist(hashes, name=odb.fs_path, **kwargs))
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 499, in hashes_exist
    remote_hashes = set(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 334, in _list_hashes_traverse
    yield from itertools.chain.from_iterable(in_remote)
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 324, in list_with_update
    return list(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 215, in _list_hashes
    for path in self._list_paths(prefix, progress_callback):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 195, in _list_paths
    for file_info in self.fs.find(fs_path, prefix=prefix):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\fsspec_wrapper.py", line 179, in find
    files = self.fs.find(with_prefix, prefix=self.path.parts(path)[-1])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 965, in find
    return sync(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 71, in sync
    raise return_result
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 25, in _runner
    result[0] = await coro
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in _find
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in <listcomp>
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 154, in __anext__
    return await self.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 157, in __anext__
    self._page = await self._page_iterator.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 99, in __anext__
    self._response = await self._get_next(self.continuation_token)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 78, in _get_next_cb
    process_storage_error(error)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\response_handlers.py", line 89, in process_storage_error
    raise storage_error
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 71, in _get_next_cb
    return await self._command(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_generated\aio\operations\_container_operations.py", line 1456, in list_blob_flat_seg
ment
    pipeline_response = await self._client._pipeline.run(request, stream=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 215, in run
    return await first_node.send(pipeline_request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  [Previous line repeated 4 more times]
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\policies\_redirect_async.py", line 64, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 125, in send
    raise err
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 99, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 56, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 116, in send
    await self._sender.send(request.http_request, **request.context.options),
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\base_client_async.py", line 180, in send
    return await self._transport.send(request, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 251, in send
    await response.load_body()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 398, in load_body
    raise IncompleteReadError(err, error=err)
azure.core.exceptions.IncompleteReadError: Response payload is not completed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 28, in _link
    func(from_path, to_path)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\local.py", line 148, in reflink
    System.reflink(from_info, to_info)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\system.py", line 114, in reflink
    raise OSError(
OSError: [Errno 129] reflink is not supported on 'Windows'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 129] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 129] no more link types left to try out
------------------------------------------------------------
2022-01-31 15:05:28,701 DEBUG: Removing 'C:\Users\piete\PycharmProjects\.3EvryqNzDqDxdTtvMemCNj.tmp'
2022-01-31 15:05:28,711 DEBUG: Removing 'C:\Users\piete\PycharmProjects\.3EvryqNzDqDxdTtvMemCNj.tmp'
2022-01-31 15:05:28,713 DEBUG: [Errno 129] no more link types left to try out: [Errno 129] 'symlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [WinError 1314] A required
 privilege is not held by the client: 'C:\\Users\\piete\\PycharmProjects\\training_pipeline_v3\\.dvc\\cache\\.7KdR4X4h4xs7K2tQhBeJVw.tmp' -> 'C:\\Users\\piete\\PycharmProjects\\.3EvryqNzD
qDxdTtvMemCNj.tmp'
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 394, in load_body
    self._content = await self.internal_response.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\client_reqrep.py", line 1036, in read
    self._body = await self.content.read()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 375, in read
    block = await self.readany()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 397, in readany
    await self._wait("readany")
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\aiohttp\streams.py", line 304, in _wait
    await waiter
aiohttp.client_exceptions.ClientPayloadError: Response payload is not completed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\main.py", line 55, in main
    ret = cmd.do_run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\base.py", line 45, in do_run
    return self.run()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\command\data_sync.py", line 57, in run
    processed_files_count = self.repo.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\repo\push.py", line 56, in push
    pushed += self.cloud.push(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\data_cloud.py", line 85, in push
    return transfer(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\transfer.py", line 153, in transfer
    status = compare_status(src, dest, obj_ids, check_deleted=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 158, in compare_status
    dest_exists, dest_missing = status(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\status.py", line 131, in status
    exists.update(odb.hashes_exist(hashes, name=odb.fs_path, **kwargs))
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 499, in hashes_exist
    remote_hashes = set(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 334, in _list_hashes_traverse
    yield from itertools.chain.from_iterable(in_remote)
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
  File "C:\Users\piete\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 324, in list_with_update
    return list(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 215, in _list_hashes
    for path in self._list_paths(prefix, progress_callback):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\objects\db\base.py", line 195, in _list_paths
    for file_info in self.fs.find(fs_path, prefix=prefix):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\fsspec_wrapper.py", line 179, in find
    files = self.fs.find(with_prefix, prefix=self.path.parts(path)[-1])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 965, in find
    return sync(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 71, in sync
    raise return_result
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\fsspec\asyn.py", line 25, in _runner
    result[0] = await coro
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in _find
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\adlfs\spec.py", line 999, in <listcomp>
    infos = await self._details([b async for b in blobs])
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 154, in __anext__
    return await self.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 157, in __anext__
    self._page = await self._page_iterator.__anext__()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\async_paging.py", line 99, in __anext__
    self._response = await self._get_next(self.continuation_token)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 78, in _get_next_cb
    process_storage_error(error)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\response_handlers.py", line 89, in process_storage_error
    raise storage_error
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\aio\_list_blobs_helper.py", line 71, in _get_next_cb
    return await self._command(
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_generated\aio\operations\_container_operations.py", line 1456, in list_blob_flat_seg
ment
    pipeline_response = await self._client._pipeline.run(request, stream=False, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 215, in run
    return await first_node.send(pipeline_request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  [Previous line repeated 4 more times]
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\policies\_redirect_async.py", line 64, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 125, in send
    raise err
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 99, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\policies_async.py", line 56, in send
    response = await self.next.send(request)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 83, in send
    response = await self.next.send(request)  # type: ignore
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\_base_async.py", line 116, in send
    await self._sender.send(request.http_request, **request.context.options),
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\storage\blob\_shared\base_client_async.py", line 180, in send
    return await self._transport.send(request, **kwargs)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 251, in send
    await response.load_body()
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\azure\core\pipeline\transport\_aiohttp.py", line 398, in load_body
    raise IncompleteReadError(err, error=err)
azure.core.exceptions.IncompleteReadError: Response payload is not completed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 28, in _link
    func(from_path, to_path)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\local.py", line 115, in symlink
    System.symlink(from_info, to_info)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\system.py", line 43, in symlink
    os.symlink(source, link_name)
OSError: [WinError 1314] A required privilege is not held by the client: 'C:\\Users\\piete\\PycharmProjects\\training_pipeline_v3\\.dvc\\cache\\.7KdR4X4h4xs7K2tQhBeJVw.tmp' -> 'C:\\Users\
\piete\\PycharmProjects\\.3EvryqNzDqDxdTtvMemCNj.tmp'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 129] 'symlink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "c:\users\piete\pycharmprojects\training_pipeline_v3\venv\lib\site-packages\dvc\fs\utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 129] no more link types left to try out
------------------------------------------------------------
2022-01-31 15:05:28,735 DEBUG: Removing 'C:\Users\piete\PycharmProjects\.3EvryqNzDqDxdTtvMemCNj.tmp'
2022-01-31 15:05:28,735 DEBUG: Removing 'C:\Users\piete\PycharmProjects\training_pipeline_v3\.dvc\cache\.7KdR4X4h4xs7K2tQhBeJVw.tmp'
2022-01-31 15:05:28,748 DEBUG: Version info for developers:
DVC version: 2.9.3 (pip)
---------------------------------
Platform: Python 3.8.8 on Windows-10-10.0.22000-SP0
Supports:
        azure (adlfs = 2021.10.0, knack = 0.8.2, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.8.3),
        webhdfs (fsspec = 2022.1.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink
Cache directory: NTFS on C:\
Caches: local
Remotes: azure
Workspace directory: NTFS on C:\
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-01-31 15:05:28,753 DEBUG: Analytics is enabled.
2022-01-31 15:05:29,319 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', 'C:\\Users\\piete\\AppData\\Local\\Temp\\tmpv8qbqtt0']'
2022-01-31 15:05:29,339 DEBUG: Spawned '['daemon', '-q', 'analytics', 'C:\\Users\\piete\\AppData\\Local\\Temp\\tmpv8qbqtt0']'
@dtrifiro dtrifiro added bug Did we break something? fs: azure Related to the Azure filesystem research labels Feb 3, 2022
@rubenpraets
Copy link
Author

Some new info:
Via our azure portal we have found some new info regarding this bug, which confirms one of the suspicions above. We are told the following:

This storage account is experiencing client timeouts. That means the Storage system is waiting too long for responses from the client or the client is taking too long to write/read data to/from Storage. In the Storage Account Metrics, this will be reflected as a high E2E latency with a low server latency.

So it seems that the connection is indeed closed from the server side. As to why this is I have no idea, because from our end it looks like dvc is continually sending requests to the blob storage to fully list it.

I have also tested if pulling data from the remote gives any trouble on my machine, as my colleague also has problems with that. It seems to work on my end, but I spotted a strange difference when going through the logs I included below. Here the estimated remote size is ~500k, which is less than half of the estimated size my colleague got when pushing (see logs in my previous comment). Even so, dvc decided to query the 59k images in this data set separately instead of querying via traverse (for some reason this is not clear from the log, but I saw the progress bar counting up to 59k before the files began downloading).

This all seems very weird to me, as I'd say that with more files to query and a smaller estimated remote size (compared to the push from my previous comment), it would only be logical for dvc to choose the traverse method here as well. What's more, the query for 59k separate files was over in seconds on my machine, while querying via traverse took minutes when I tried to push a data set earlier. With all this I just don't understand why dvc chooses to query via traverse when doing a push.

Could this be part of the reason why we get to a timeout? That the estimated times for querying the remote are somehow wrong?

2022-02-04 15:59:30,970 DEBUG: Adding 'C:\Users\ruben\PycharmProjects\AIP\.dvc\config.local' to gitignore file.
2022-02-04 15:59:30,997 DEBUG: Adding 'C:\Users\ruben\PycharmProjects\AIP\.dvc\tmp' to gitignore file.
2022-02-04 15:59:30,998 DEBUG: Adding 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache' to gitignore file.
2022-02-04 15:59:40,020 DEBUG: Preparing to transfer data from 'dvc/data_v3' to 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 15:59:40,021 DEBUG: Preparing to collect status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 15:59:40,022 DEBUG: Collecting status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 15:59:40,023 DEBUG: Preparing to collect status from 'dvc/data_v3'
2022-02-04 15:59:40,024 DEBUG: Collecting status from 'dvc/data_v3'
2022-02-04 15:59:58,015 DEBUG: Querying 34 hashes via object_exists
2022-02-04 16:00:17,481 DEBUG: Querying 1 hashes via object_exists
2022-02-04 16:00:18,674 DEBUG: Querying 1 hashes via object_exists
2022-02-04 16:00:26,237 DEBUG: Downloading 'dvc/data_v3/83/9e2292200419ac6115889aca5a5b46.dir' to 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache\83\9e2292200419ac6115889aca5a5b46.dir'
2022-02-04 16:00:26,962 DEBUG: state save (75716768735244371, 1643986826960344576, 2621934) 839e2292200419ac6115889aca5a5b46.dir
2022-02-04 16:00:27,176 DEBUG: Preparing to transfer data from 'dvc/data_v3' to 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:00:27,177 DEBUG: Preparing to collect status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:00:27,178 DEBUG: Collecting status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:00:27,180 DEBUG: Preparing to collect status from 'dvc/data_v3'
2022-02-04 16:00:27,180 DEBUG: Collecting status from 'dvc/data_v3'
2022-02-04 16:00:43,742 DEBUG: Querying 34 hashes via object_exists
2022-02-04 16:01:05,816 DEBUG: Querying 1 hashes via object_exists
2022-02-04 16:01:06,975 DEBUG: Querying 1 hashes via object_exists
2022-02-04 16:01:14,293 DEBUG: Downloading 'dvc/data_v3/06/8f7b2ef3f761734b051be5dd38bc51.dir' to 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache\06\8f7b2ef3f761734b051be5dd38bc51.dir'
2022-02-04 16:01:15,020 DEBUG: state save (3377699720678138, 1643986875017124864, 2621851) 068f7b2ef3f761734b051be5dd38bc51.dir
2022-02-04 16:01:15,237 DEBUG: Preparing to transfer data from 'dvc/data_v3' to 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:01:15,238 DEBUG: Preparing to collect status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:01:15,282 DEBUG: Collecting status from 'C:\Users\ruben\PycharmProjects\AIP\.dvc\cache'
2022-02-04 16:01:29,588 DEBUG: Preparing to collect status from 'dvc/data_v3'
2022-02-04 16:01:29,635 DEBUG: Collecting status from 'dvc/data_v3'
2022-02-04 16:01:46,008 DEBUG: Querying 34 hashes via object_exists
2022-02-04 16:02:04,962 DEBUG: Querying 2 hashes via object_exists
2022-02-04 16:02:05,807 DEBUG: Indexing new .dir '068f7b2ef3f761734b051be5dd38bc51.dir' with '29542' nested files
2022-02-04 16:02:08,209 DEBUG: Indexing new .dir '839e2292200419ac6115889aca5a5b46.dir' with '29543' nested files
2022-02-04 16:02:12,917 DEBUG: `_list_hashes()` returned max '122.0703125' hashes, skipping remaining results
2022-02-04 16:02:12,918 DEBUG: Estimated remote size: 503808 files
2022-02-04 16:02:12,919 DEBUG: Large remote ('4' hashes < '2519.04' traverse weight), using object_exists for remaining hashes
2022-02-04 16:02:12,920 DEBUG: Querying 4 hashes via object_exists

@dtrifiro
Copy link
Contributor

Hi @rubenpraets, what's the CPU count on the machine you're running push/pull from?

python -c "from multiprocessing import cpu_count; print(cpu_count())"    

@rubenpraets
Copy link
Author

From my colleagues machine (the one having trouble) we get 8.
My machine (much less trouble): 12

@daavoo daavoo added the A: data-sync Related to dvc get/fetch/import/pull/push label Feb 22, 2022
@rubenpraets
Copy link
Author

Hi @dtrifiro @daavoo, is there any progress on this? Is anyone working on it? Do you maybe know of a possible workaround until the problem can be fixed for good?

@thunfischtoast
Copy link

I'm working on a project that stores a very big DVC cache in S3.
Sometimes the connection will time out while
Querying remote cache, usually with a
ERROR: unexpected error - Unable to locate credentials

What might help is reducing the job count with the -j parameter or by pushing one stage at a time, but this of course is no reliable answer, sorry.

@dtrifiro
Copy link
Contributor

Hi @dtrifiro @daavoo, is there any progress on this? Is anyone working on it? Do you maybe know of a possible workaround until the problem can be fixed for good?

Hi @rubenpraets, sorry for getting back to you so late. I've had a look at the issue, and I think it might be related to the way that the azure python sdk (used by adlfs) uses aiohttp, but I see no easy way to test for a solution (I would try lowering the socket_timeout aiohttp kwarg and maybe adding enable_cleanup_closed=True as in #7460).

I found a related issue here (Azure/azure-sdk-for-python#17974), but it seems it has been solved. I'd try to pip upgrade -U azure-storage-blob adlfs to see if it fixes anything before taking it to Microsoft.

@rubenpraets
Copy link
Author

@dtrifiro Thanks for the pointers, I will look into them shortly. Even though the error might be caused by underlying libraries, I do want to stress that your heuristic for choosing between querying via object_exists or traverse seems to be off as well, at least for azure. It seems like in this specific situation object_exists is many times faster than traverse (which is chosen here), which indirectly causes errors in the long-running connections. You might want to look into this as well, it would greatly improve the usability of dvc.

@efiop
Copy link
Contributor

efiop commented Jan 1, 2023

closing as stale

@efiop efiop closed this as completed Jan 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: azure Related to the Azure filesystem research
Projects
None yet
Development

No branches or pull requests

5 participants