Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Errno 24 - Too many open files' on dvc push #2473

Closed
ChrisHowlin opened this issue Sep 6, 2019 · 35 comments · Fixed by #2866
Closed

'Errno 24 - Too many open files' on dvc push #2473

ChrisHowlin opened this issue Sep 6, 2019 · 35 comments · Fixed by #2866
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research

Comments

@ChrisHowlin
Copy link

ChrisHowlin commented Sep 6, 2019

Version information

  • DVC version: 0.58.1
  • Platform: MacOS 10.14.6
  • Method of installation: pip within a conda environment

Description

When pushing to S3 a directory of ~100 files that have been added to DVC, I observe an Errno 24 error from the dvc process.

It looks like dvc is trying to open more files than the OS allows. Checking the file handles on for the dvc process I get:

$ lsof -p $DVC_PID | wc -l
412

Looking at the OS limits, a process is limited to having 256 open files.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4256
virtual memory          (kbytes, -v) unlimited

A workaround for this is to increase the max files per process to a larger number (say 4096) by running something like ulimit -n 4096, but I wonder if the ideal solution is for DVC to work within the OS configured limits by default?

Edit: Updated wording of workaround

@efiop
Copy link
Contributor

efiop commented Sep 6, 2019

Hi @ChrisHowlin !

Thank you for reporting this! 🙂

Could you post full log, please?

It is 100 files, right? Not 100K? Just making sure I understand you correctly. If it is 100, we might be leaking fds 🙁 Mind also trying dvc version 0.56.0, to see if that work? We've had some changes to dvc since that version, that I suspect might be the cause of the leaks.

Thanks,
Ruslan

@efiop efiop added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Sep 6, 2019
@ChrisHowlin
Copy link
Author

I checked again, and the number of files being pushed was actually 79 (definitely not thousands 😄 )

I don't have the full logs from the dvc push to hand right now, but I was getting the below for repeatedly for various files. This seemed to be happening as soon as the first files finished uploading.

ERROR: failed to upload '.dvc/cache/2f/25387c98c599ab0de148f437b780ad' to 's3://<s3_path>/2f/25387c98c599ab0de148f437b780ad' - [Errno 24] Too many open files: '/<local_path>/.dvc/cache/2f/25387c98c599ab0de148f437b780ad'
Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

I don't have access to the environment until Monday, but I might be able to reproduce over the weekend.

@efiop
Copy link
Contributor

efiop commented Sep 9, 2019

Hi @ChrisHowlin ! Were you able to reproduce it again? 🙂

@efiop
Copy link
Contributor

efiop commented Sep 10, 2019

Just for the record, we are doing tests with millions of files and not seeing reaching the linux opened fds limit (which is 4K), so I don't think we are seriously leaking fds. But with 256 on mac and 70 files, in 4*NCPU workers, we might indeed be pushing it a bit. Using lower --jobs number should make this work, but we might want to take a look into our opened fds just to see if there is anything we can close without harm. Would love to hear if that workaround actually works for you, @ChrisHowlin . 🙂

@ChrisHowlin
Copy link
Author

I have got some time to have a look again today, and going through the steps to try and reproduce with 0.56.0.

Some observations I have so far is that, this error appears right at the start of the push, before I see any progress on the upload bars. Also, tracking the lsof count for the dvc push process over time shows the number of open files reducing over time.

I wonder if the problem is not an fd leak, instead too many threads are opening files and sockets to S3, beyond the limit the OS defines for the process (default 256)?

I will provide some more logs on the different versions when I have them.

@ChrisHowlin
Copy link
Author

I have managed to reproduce on 0.56.0 and also have identified minimal-ish reproduction instructions using test data:

Repro instructions


# Pre-requisites
pip install dvc==0.56.0
ulimit -n 256  # Set open files per proc limit to 256, if not set already

# Create new repo
mkdir dvc_many_files
cd dvc_many_files
git init
dvc init
dvc remote add -d upstream s3://<S3 URL>
mkdir data

# Create 100 files of 100MB using random data
# Note, this is potentially bash and MacOS specific
for i in {1..100}; do dd if=/dev/urandom bs=100000000 count=1 of=data/file$i; done

dvc add -R data

# Error occurs here
dvc push

Repro log for dvc push (on 0.56.0)

~/dvc_many_files_test $ dvc push
+-------------------------------------------+
|                                           |
|     Update available 0.56.0 -> 0.59.2     |
|       Run pip install dvc --upgrade       |
|                                           |
+-------------------------------------------+

Preparing to upload data to 's3://<S3 bucket>/dvc_many_files_test'
Preparing to collect status from s3://<S3 bucket>/dvc_many_files_test
Collecting information from local cache...
[##############################] 100%

Collecting information from remote cache...
[##############################] 100%
[##############################] 100% Analysing status
[                              ] 2% data/file91
ERROR: failed to upload '.dvc/cache/94/f505e87c2ecf8800c18ef183e5fc1f' to 's3://<S3 bucket>/dvc_many_files_test/94/f505e87c2ecf8800c18ef183e5fc1f' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/94/f505e87c2ecf8800c18ef183e5fc1f'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file53
ERROR: failed to upload '.dvc/cache/28/33a50139e861ababfaae379ee1dfc6' to 's3://<S3 bucket>/dvc_many_files_test/28/33a50139e861ababfaae379ee1dfc6' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/28/33a50139e861ababfaae379ee1dfc6'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file90
[#                             ] 4% data/file47ERROR: failed to upload '.dvc/cache/36/6c860189e1c1bff2d8946c30568a8f' to 's3://<S3 bucket>/dvc_many_files_test/36/6c860189e1c1bff2d8946c30568a8f' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/36/6c860189e1c1bff2d8946c30568a8f'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 4% data/file91
ERROR: failed to upload '.dvc/cache/a2/f1468aeac80f884693b9b208ab01cd' to 's3://<S3 bucket>/dvc_many_files_test/a2/f1468aeac80f884693b9b208ab01cd' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/a2/f1468aeac80f884693b9b208ab01cd'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file83
ERROR: failed to upload '.dvc/cache/ca/13773b5d9feea4a81b9489c7a47a3d' to 's3://<S3 bucket>/dvc_many_files_test/ca/13773b5d9feea4a81b9489c7a47a3d' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/ca/13773b5d9feea4a81b9489c7a47a3d'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 4% data/file91
ERROR: failed to upload '.dvc/cache/e4/1dee6ea60377e6a84c26a830a67360' to 's3://<S3 bucket>/dvc_many_files_test/e4/1dee6ea60377e6a84c26a830a67360' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/e4/1dee6ea60377e6a84c26a830a67360'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 1% data/file84
ERROR: failed to upload '.dvc/cache/90/809bb6e9bcb883ac15de437293c989' to 's3://<S3 bucket>/dvc_many_files_test/90/809bb6e9bcb883ac15de437293c989' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/90/809bb6e9bcb883ac15de437293c989'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file47
ERROR: failed to upload '.dvc/cache/52/09af8f9845b5f9c24d1db75d17d85a' to 's3://<S3 bucket>/dvc_many_files_test/52/09af8f9845b5f9c24d1db75d17d85a' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/52/09af8f9845b5f9c24d1db75d17d85a'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file91
ERROR: failed to upload '.dvc/cache/03/5b32c997321d49f64f0783b66f1e4e' to 's3://<S3 bucket>/dvc_many_files_test/03/5b32c997321d49f64f0783b66f1e4e' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/03/5b32c997321d49f64f0783b66f1e4e'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file52
ERROR: failed to upload '.dvc/cache/ab/898d5a907d0f7c44670d4fb3cf200c' to 's3://<S3 bucket>/dvc_many_files_test/ab/898d5a907d0f7c44670d4fb3cf200c' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/ab/898d5a907d0f7c44670d4fb3cf200c'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file84
ERROR: failed to upload '.dvc/cache/bc/ad450a109af254c8be7790b19c5a69' to 's3://<S3 bucket>/dvc_many_files_test/bc/ad450a109af254c8be7790b19c5a69' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/bc/ad450a109af254c8be7790b19c5a69'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 0% data/file97
ERROR: failed to upload '.dvc/cache/3f/0d47fdc3cc700aea48e48a8ded4c5c' to 's3://<S3 bucket>/dvc_many_files_test/3f/0d47fdc3cc700aea48e48a8ded4c5c' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/3f/0d47fdc3cc700aea48e48a8ded4c5c'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file52
ERROR: failed to upload '.dvc/cache/2e/b4611efe25405c858da0f62fab3c02' to 's3://<S3 bucket>/dvc_many_files_test/2e/b4611efe25405c858da0f62fab3c02' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/2e/b4611efe25405c858da0f62fab3c02'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file50
ERROR: failed to upload '.dvc/cache/5d/6c3144597973b79ea06b368d4b4b24' to 's3://<S3 bucket>/dvc_many_files_test/5d/6c3144597973b79ea06b368d4b4b24' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/5d/6c3144597973b79ea06b368d4b4b24'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file51
ERROR: failed to upload '.dvc/cache/4e/7f0799b7972ea0ef8f7edd5816acaa' to 's3://<S3 bucket>/dvc_many_files_test/4e/7f0799b7972ea0ef8f7edd5816acaa' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/4e/7f0799b7972ea0ef8f7edd5816acaa'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 0% data/file97
ERROR: failed to upload '.dvc/cache/ec/e09f9f079b6b52710490f8a125a431' to 's3://<S3 bucket>/dvc_many_files_test/ec/e09f9f079b6b52710490f8a125a431' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/ec/e09f9f079b6b52710490f8a125a431?uploadId=zj5.u8YBvrFL7TXZI6ObT8BdosYfHSOPPyXRndYX0i2rUo.Uhy4jMs7D9E9QX6Hk68cZWl1il_OaZ5Fq11lpojAvsL1Ca.ql4epwIeOHEFrj.IP_IAwNT4BKot.uwQyu&partNumber=9"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file47
ERROR: failed to upload '.dvc/cache/a3/ad651500cc5d1bae786c1658eb06be' to 's3://<S3 bucket>/dvc_many_files_test/a3/ad651500cc5d1bae786c1658eb06be' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/a3/ad651500cc5d1bae786c1658eb06be'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 1% data/file97
ERROR: failed to upload '.dvc/cache/23/58acefc522fa7585a1fba6e7c35236' to 's3://<S3 bucket>/dvc_many_files_test/23/58acefc522fa7585a1fba6e7c35236' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/23/58acefc522fa7585a1fba6e7c35236'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file86
ERROR: failed to upload '.dvc/cache/2a/1ed0ceca6128c1e55d28b363bded5c' to 's3://<S3 bucket>/dvc_many_files_test/2a/1ed0ceca6128c1e55d28b363bded5c' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/2a/1ed0ceca6128c1e55d28b363bded5c'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file92
ERROR: failed to upload '.dvc/cache/f9/d8068ef642c1d61993151a94fa7523' to 's3://<S3 bucket>/dvc_many_files_test/f9/d8068ef642c1d61993151a94fa7523' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/f9/d8068ef642c1d61993151a94fa7523'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file92
ERROR: failed to upload '.dvc/cache/d4/7a887106ef626f204fc693395a3c5e' to 's3://<S3 bucket>/dvc_many_files_test/d4/7a887106ef626f204fc693395a3c5e' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/d4/7a887106ef626f204fc693395a3c5e'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 3% data/file50
ERROR: failed to upload '.dvc/cache/1a/bd64ea4334fc8d441236d8a98dd681' to 's3://<S3 bucket>/dvc_many_files_test/1a/bd64ea4334fc8d441236d8a98dd681' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/1a/bd64ea4334fc8d441236d8a98dd681'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file19
ERROR: failed to upload '.dvc/cache/f4/1153ee84d4ce38833cded1f93537f9' to 's3://<S3 bucket>/dvc_many_files_test/f4/1153ee84d4ce38833cded1f93537f9' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/f4/1153ee84d4ce38833cded1f93537f9'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[##                            ] 7% data/file52
ERROR: failed to upload '.dvc/cache/92/6844cee8fbb66f66068a7d86ed770e' to 's3://<S3 bucket>/dvc_many_files_test/92/6844cee8fbb66f66068a7d86ed770e' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/92/6844cee8fbb66f66068a7d86ed770e?uploadId=8pDWKXaDGMNhpfCGAl6mekLcuytP6DO15rTla8Gj8pYZQOB2WgVZ8eq7ht8XboEz6IDu8W6sFxqB4hiPRkaYk1XAgDaub.weKDcWcq4XUf84n0i43ap6pPjje057ID5.&partNumber=4"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file47
ERROR: failed to upload '.dvc/cache/7f/0c2f36eed59d8cea3a456368c6fecf' to 's3://<S3 bucket>/dvc_many_files_test/7f/0c2f36eed59d8cea3a456368c6fecf' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/7f/0c2f36eed59d8cea3a456368c6fecf'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[##                            ] 9% data/file51
ERROR: failed to upload '.dvc/cache/d4/3b4c5df21af285f054a611f4595dd2' to 's3://<S3 bucket>/dvc_many_files_test/d4/3b4c5df21af285f054a611f4595dd2' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/d4/3b4c5df21af285f054a611f4595dd2'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 11% data/file47
ERROR: failed to upload '.dvc/cache/50/9c097cf2184038c2f316c02f859ba5' to 's3://<S3 bucket>/dvc_many_files_test/50/9c097cf2184038c2f316c02f859ba5' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/50/9c097cf2184038c2f316c02f859ba5'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 10% data/file91
ERROR: failed to upload '.dvc/cache/f4/2e27370514dddd2d4d6025daa94cea' to 's3://<S3 bucket>/dvc_many_files_test/f4/2e27370514dddd2d4d6025daa94cea' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/f4/2e27370514dddd2d4d6025daa94cea'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file2
ERROR: failed to upload '.dvc/cache/5b/dfda2f9ecaa54927016cbd3d410620' to 's3://<S3 bucket>/dvc_many_files_test/5b/dfda2f9ecaa54927016cbd3d410620' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/5b/dfda2f9ecaa54927016cbd3d410620'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 2% data/file24
ERROR: failed to upload '.dvc/cache/1c/75b63f844cdde39ab9aeaae0a26425' to 's3://<S3 bucket>/dvc_many_files_test/1c/75b63f844cdde39ab9aeaae0a26425' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/1c/75b63f844cdde39ab9aeaae0a26425'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[##                            ] 8% data/file86
ERROR: failed to upload '.dvc/cache/38/bed7aec1fbba7bc22c11e8c870e93b' to 's3://<S3 bucket>/dvc_many_files_test/38/bed7aec1fbba7bc22c11e8c870e93b' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/38/bed7aec1fbba7bc22c11e8c870e93b'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 10% data/file97
ERROR: failed to upload '.dvc/cache/7e/d55d70e886096b4bff8660244adf4d' to 's3://<S3 bucket>/dvc_many_files_test/7e/d55d70e886096b4bff8660244adf4d' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/7e/d55d70e886096b4bff8660244adf4d?uploadId=pd0i0Ja49sxE4J7kMTdAPN37ZSiXuhve6birpZmPDSGIAj4fwIsxlXBk3H4ocURh_4jlh_WjsA1FfoZZ_Hl3_wcOKsvRhrdsfWrdeiuFUhoM0wzIT4Q78Mn3fMj7_RZo&partNumber=6"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 12% data/file91
ERROR: failed to upload '.dvc/cache/90/93e1dfad3c5f259eb85ec02a2b98e7' to 's3://<S3 bucket>/dvc_many_files_test/90/93e1dfad3c5f259eb85ec02a2b98e7' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/90/93e1dfad3c5f259eb85ec02a2b98e7'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 13% data/file53
[###                           ] 10% data/file97ERROR: failed to upload '.dvc/cache/13/1f00077ff1ebfe7a7203273d82490a' to 's3://<S3 bucket>/dvc_many_files_test/13/1f00077ff1ebfe7a7203273d82490a' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/13/1f00077ff1ebfe7a7203273d82490a?uploadId=zR6vIoO0S5l11GnLhU0BLkuheJmP9zKyZhnJLB5Ej1.Tco7hExVUF1mhV3JdTGtP5G9zkkBLgMlNpXAy5qXkclwn7xrhZ9iG4Xpboer.d6GQqevCs5joLwSKVdRFjNkg&partNumber=3"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 14% data/file53
ERROR: failed to upload '.dvc/cache/7c/ee4d6732d838387c3da0721d60c664' to 's3://<S3 bucket>/dvc_many_files_test/7c/ee4d6732d838387c3da0721d60c664' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/7c/ee4d6732d838387c3da0721d60c664'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 3% data/file26
ERROR: failed to upload '.dvc/cache/07/8b99f1a3ecb976100f5ff8d9759668' to 's3://<S3 bucket>/dvc_many_files_test/07/8b99f1a3ecb976100f5ff8d9759668' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/07/8b99f1a3ecb976100f5ff8d9759668'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 16% data/file85
ERROR: failed to upload '.dvc/cache/39/66cd187c1450f9a3f4b37f138368d4' to 's3://<S3 bucket>/dvc_many_files_test/39/66cd187c1450f9a3f4b37f138368d4' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/39/66cd187c1450f9a3f4b37f138368d4'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 4% data/file26
ERROR: failed to upload '.dvc/cache/e5/3884fa1cf4a6a25fb774634a24013d' to 's3://<S3 bucket>/dvc_many_files_test/e5/3884fa1cf4a6a25fb774634a24013d' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/e5/3884fa1cf4a6a25fb774634a24013d?uploadId=Jc_bEutZbJB2JcnvI2gbP9k4VHgzSnWxFiWpQFqtZURBEmvZLYutM6Edtt.VPJN0nubp6KUtQ4HOl0NJ5qM2mzRl5TQFbgKbHSYgowzQStENYxT9PJHndwqAQQ8osjqE&partNumber=4"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file36
ERROR: failed to upload '.dvc/cache/04/46e91089131693ff107a0ecfae7ba8' to 's3://<S3 bucket>/dvc_many_files_test/04/46e91089131693ff107a0ecfae7ba8' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/04/46e91089131693ff107a0ecfae7ba8'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#####                         ] 17% data/file85
ERROR: failed to upload '.dvc/cache/af/af499a3df76772e0b613e2f764ee58' to 's3://<S3 bucket>/dvc_many_files_test/af/af499a3df76772e0b613e2f764ee58' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/af/af499a3df76772e0b613e2f764ee58'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file4
ERROR: failed to upload '.dvc/cache/fc/3d51d2deb8fb8c720fda7442dbe9fe' to 's3://<S3 bucket>/dvc_many_files_test/fc/3d51d2deb8fb8c720fda7442dbe9fe' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/fc/3d51d2deb8fb8c720fda7442dbe9fe'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 14% data/file52
ERROR: failed to upload '.dvc/cache/f4/6366c801ac582633d0861a750a25be' to 's3://<S3 bucket>/dvc_many_files_test/f4/6366c801ac582633d0861a750a25be' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/f4/6366c801ac582633d0861a750a25be'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 13% data/file91
ERROR: failed to upload '.dvc/cache/bc/b0d5fd5cdded766d128fae9c32ff4c' to 's3://<S3 bucket>/dvc_many_files_test/bc/b0d5fd5cdded766d128fae9c32ff4c' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/bc/b0d5fd5cdded766d128fae9c32ff4c'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file34
ERROR: failed to upload '.dvc/cache/e0/ba85929d48d0a150f3f6d3b181ca12' to 's3://<S3 bucket>/dvc_many_files_test/e0/ba85929d48d0a150f3f6d3b181ca12' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/e0/ba85929d48d0a150f3f6d3b181ca12'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 15% data/file52
ERROR: failed to upload '.dvc/cache/f0/0d19e3294331d7254284fd755cdbc9' to 's3://<S3 bucket>/dvc_many_files_test/f0/0d19e3294331d7254284fd755cdbc9' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/f0/0d19e3294331d7254284fd755cdbc9'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 14% data/file51
ERROR: failed to upload '.dvc/cache/40/50627d79382c3abab8ac4436021aa8' to 's3://<S3 bucket>/dvc_many_files_test/40/50627d79382c3abab8ac4436021aa8' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/40/50627d79382c3abab8ac4436021aa8'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file7
ERROR: failed to upload '.dvc/cache/67/b94a63dd146a962feb6944ff775b18' to 's3://<S3 bucket>/dvc_many_files_test/67/b94a63dd146a962feb6944ff775b18' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/67/b94a63dd146a962feb6944ff775b18'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 13% data/file91
ERROR: failed to upload '.dvc/cache/51/a986ed06d5ea6eea398e48d9a7f2b5' to 's3://<S3 bucket>/dvc_many_files_test/51/a986ed06d5ea6eea398e48d9a7f2b5' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/51/a986ed06d5ea6eea398e48d9a7f2b5'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 10% data/file86
ERROR: failed to upload '.dvc/cache/88/d82ec856e5d0c1e2f8c3422f32e111' to 's3://<S3 bucket>/dvc_many_files_test/88/d82ec856e5d0c1e2f8c3422f32e111' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/88/d82ec856e5d0c1e2f8c3422f32e111'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file26
ERROR: failed to upload '.dvc/cache/9e/0afa862975614304a5fefe07ec2f14' to 's3://<S3 bucket>/dvc_many_files_test/9e/0afa862975614304a5fefe07ec2f14' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/9e/0afa862975614304a5fefe07ec2f14'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 5% data/file26
ERROR: failed to upload '.dvc/cache/75/194fd4bbc5d039802bb801e297dbf4' to 's3://<S3 bucket>/dvc_many_files_test/75/194fd4bbc5d039802bb801e297dbf4' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/75/194fd4bbc5d039802bb801e297dbf4'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 11% data/file43
ERROR: failed to upload '.dvc/cache/a5/344804c085d536e8d104e41533c57d' to 's3://<S3 bucket>/dvc_many_files_test/a5/344804c085d536e8d104e41533c57d' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/a5/344804c085d536e8d104e41533c57d'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 14% data/file91
ERROR: failed to upload '.dvc/cache/7d/59046fed8e557a41ec8facf49e2d21' to 's3://<S3 bucket>/dvc_many_files_test/7d/59046fed8e557a41ec8facf49e2d21' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/7d/59046fed8e557a41ec8facf49e2d21'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file26
ERROR: failed to upload '.dvc/cache/67/49e8b469a645f927569c00952da2d6' to 's3://<S3 bucket>/dvc_many_files_test/67/49e8b469a645f927569c00952da2d6' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/67/49e8b469a645f927569c00952da2d6'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file26
ERROR: failed to upload '.dvc/cache/61/5e6683ddfbbf81102d63c6ba796269' to 's3://<S3 bucket>/dvc_many_files_test/61/5e6683ddfbbf81102d63c6ba796269' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/61/5e6683ddfbbf81102d63c6ba796269?uploadId=Jra18INvz_FIo9h07t7I_f_WqBzK5qblysvkkp7wNHiBVeJ5f37MdFUd6XQVWPilIrzXemqwMm9f3EbDvW3MOJDjOt7AJCKU1oTuQH8gjOv4jzedU4u8U2FtyfpLrVB6&partNumber=8"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 12% data/file43
ERROR: failed to upload '.dvc/cache/e2/8b8c31cad20b27fd8d786c26019dfe' to 's3://<S3 bucket>/dvc_many_files_test/e2/8b8c31cad20b27fd8d786c26019dfe' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/e2/8b8c31cad20b27fd8d786c26019dfe'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 6% data/file26
ERROR: failed to upload '.dvc/cache/31/2236cd7f963add79a4a42082b804de' to 's3://<S3 bucket>/dvc_many_files_test/31/2236cd7f963add79a4a42082b804de' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/31/2236cd7f963add79a4a42082b804de'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file15
ERROR: failed to upload '.dvc/cache/26/dada6fb3c142dd8caf65143a701091' to 's3://<S3 bucket>/dvc_many_files_test/26/dada6fb3c142dd8caf65143a701091' - Could not connect to the endpoint URL: "https://<S3 bucket>.s3.eu-west-2.amazonaws.com/dvc_many_files_test/26/dada6fb3c142dd8caf65143a701091?uploadId=Pr0PDy3MYAwcdK61oUGyR.yl8qKT7OSmOsnu.yGz6lCPU4vS3YZGqd0Sj6iII9aON8En2N7ACtO9OvSCGdkbVs_j1JJBGbIEVDzzy0bJrO1fzAo1juqJTMqv9FzTf3Sm&partNumber=10"

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

[                              ] ?% data/file29ERROR: failed to upload '.dvc/cache/3c/bdb8ba8d22523ff1f9d80438fe805d' to 's3://<S3 bucket>/dvc_many_files_test/3c/bdb8ba8d22523ff1f9d80438fe805d' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/3c/bdb8ba8d22523ff1f9d80438fe805d'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file28
ERROR: failed to upload '.dvc/cache/64/1af74eb5d1c157c4ebd47423655c6e' to 's3://<S3 bucket>/dvc_many_files_test/64/1af74eb5d1c157c4ebd47423655c6e' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/64/1af74eb5d1c157c4ebd47423655c6e'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] 0% data/file9
ERROR: failed to upload '.dvc/cache/de/ec9b8e358242912c7fcb516146c764' to 's3://<S3 bucket>/dvc_many_files_test/de/ec9b8e358242912c7fcb516146c764' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/de/ec9b8e358242912c7fcb516146c764'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[                              ] ?% data/file98
ERROR: failed to upload '.dvc/cache/4c/055a0bac15aa0b28d6e8ff5b9bc537' to 's3://<S3 bucket>/dvc_many_files_test/4c/055a0bac15aa0b28d6e8ff5b9bc537' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/4c/055a0bac15aa0b28d6e8ff5b9bc537'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!

ERROR: failed to upload '.dvc/cache/c0/e9fd46cca1f0fe6dfd759196fc6fc9' to 's3://<S3 bucket>/dvc_many_files_test/c0/e9fd46cca1f0fe6dfd759196fc6fc9' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/c0/e9fd46cca1f0fe6dfd759196fc6fc9'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[###                           ] 10% data/file90
ERROR: failed to upload '.dvc/cache/83/e180e32a4a4d9664f5c2464fd0cf35' to 's3://<S3 bucket>/dvc_many_files_test/83/e180e32a4a4d9664f5c2464fd0cf35' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/83/e180e32a4a4d9664f5c2464fd0cf35'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[####                          ] 14% data/file91
ERROR: failed to upload '.dvc/cache/6e/c586afabe4e13f1adb12f168d7179c' to 's3://<S3 bucket>/dvc_many_files_test/6e/c586afabe4e13f1adb12f168d7179c' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/6e/c586afabe4e13f1adb12f168d7179c'

Having any troubles?. Hit us up at https://dvc.org/support, we are always happy to help!
[#                             ] 4% data/file9
ERROR: failed to upload '.dvc/cache/16/c254f07f818e0dc9c552729d436ba4' to 's3://<S3 bucket>/dvc_many_files_test/16/c254f07f818e0dc9c552729d436ba4' - [Errno 24] Too many open files: '/<user>/dvc_many_files_test/.dvc/cache/16/c254f07f818e0dc9c552729d436ba4'

[...]

~/dvc_many_files_test $

@ChrisHowlin
Copy link
Author

Some observations:

  • When trying to reproduce this, it seems significant that the files are of a decent size (in this case 100MB)
  • Reproducing by creating files from /dev/zero instead of /dev/urandom doesn't seem to work (I think there is some background compression or trick for this low entropy file)
  • Some of the logs indicate errors uploading a part of the file (search partNumber), suggesting multi-threaded upload.

@efiop
Copy link
Contributor

efiop commented Sep 16, 2019

@ChrisHowlin Sorry for the late response. Thank you for the investigation! 🙂 So to summarize, while we are taking a closer look at this, a workaround for you would be to either increase ulimit or lower --jobs on push/pull.

For the record: I am able to reproduce with your script on my mac.

@ChrisHowlin
Copy link
Author

I can confirm the above workarounds. To summarise (for anyone else experiencing this), I was able to work around the issue with either of these commands:

  • dvc push --jobs 8 - Reducing number of jobs to 2*CPUs (8 in my case)
  • ulimit -n 4096 - Increasing OS limit on open files per proc (run dvc push after setting this)

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p0-critical Critical issue. Needs to be fixed ASAP. labels Sep 23, 2019
@ghost
Copy link

ghost commented Oct 2, 2019

@efiop , are we leaking some file descriptors? is this expected in macos? (we don't have anything related to ulimit in our docs)

@efiop
Copy link
Contributor

efiop commented Oct 2, 2019

@MrOutis Probably, haven't looked into this yet. Mac has a really low ulimit for opened fds compared to some ordinary linux distro, and that very often gets in the way, as I've noted above. I would take a look at this first and then look into modifying the docs.

@pared
Copy link
Contributor

pared commented Nov 13, 2019

Is this error still valid? Maybe there is similar problem as we had in #2600?
Can I take this?

@efiop
Copy link
Contributor

efiop commented Nov 13, 2019

@ChrisHowlin Could you try the newest version of dvc (0.68.1) and see if you are still able to reproduce this issue?

@pared Very similar, indeed! Sure, please go ahead!

@pared pared self-assigned this Nov 14, 2019
@pared
Copy link
Contributor

pared commented Nov 14, 2019

@efiop, @ChrisHowlin
I managed to reproduce with following script:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
git init >> /dev/null && dvc init -q

mkdir data
for i in {1..1000}
do
	echo $i >> data/$i
done

dvc add data
dvc remote add -d rmt s3://prd-dvc-test/cache

ulimit -n 25
dvc push -v

seems that problem is similar as in #2600, will take a closer look inside boto

@efiop
Copy link
Contributor

efiop commented Nov 15, 2019

@pared setting ulimit to 25 might be way too low just in general, need to be careful about that. What I did for #2600 is I've monitored the number of opened fds from /proc/$PID/ and it was pretty apparent that some stuff was not released in time. Might want to look into caching the boto session as a quick experiment.

@pared
Copy link
Contributor

pared commented Nov 15, 2019

Probably related:
boto/boto3#1402

@pared
Copy link
Contributor

pared commented Nov 20, 2019

Even when setting --jobs=1 process can have opened up to 14 file descriptors. Need to investigate whether it is our fault or boto.

@pared
Copy link
Contributor

pared commented Nov 25, 2019

it's, not a bug, its a feature
Everyone, at some point in their life

So the thing is that we easily reach the open descriptors limit on mac. It is unnoticeable for Linux because it has 4 times bigger default open descriptors limit than mac.

The reason why we are reaching the limit is the default configuration of s3. When we do not provide Config here boto3 will default to:

multipart_threshold=8 * MB,
max_concurrency=10,
multipart_chunksize=8 * MB,
num_download_attempts=5,
max_io_queue=100,
io_chunksize=256 * KB,
use_threads=True

(defined here)
s3 is not caching transfer object in any way. Every time upload_file is called, the new transfer object is created, having, by default 10 threads limit for upload. So, effectively, when we run dvc push --jobs=X we indirectly allow s3 to create 10X threads. It's easy to throttle mac's 256 file descriptors limit with that.

I think, we should introduce default upload config in remote/s3 class:
config = TransferConfig(use_threads=False).

@efiop
Copy link
Contributor

efiop commented Nov 25, 2019

@pared Great investigation!

I think, we should introduce default upload config in remote/s3 class:
config = TransferConfig(use_threads=False).

Woudn't that slow down the uploading of big multipart files?

@pared
Copy link
Contributor

pared commented Nov 25, 2019

Also, one more note:
If we have too small file descriptors limit, we can have another Too many open files error, when boto is, loading json files defined inside botocore/data.

@efiop
Copy link
Contributor

efiop commented Nov 25, 2019

Also, one more note:
If we have too small file descriptors limit, we can have another Too many open files error, when boto is, loading json files defined inside botocore/data.

Great point! But anything less then what mac has is an unlikely scenario and we shouldn't worry about it, as lots of other stuff will break anyway.

@pared
Copy link
Contributor

pared commented Nov 25, 2019

Woudn't that slow down the uploading of big multipart files?

Reading through the docs of boto:
It is not written if multipart upload is dependant if threads are on or off.

So it seems to me that it for sure will be slower, but only because we restrict the number of jobs to X rather than 10X.

Do you think we should do some performance tests whether the upload of a few big files is faster when using default s3 settings than when creating single threads for each one?

@efiop
Copy link
Contributor

efiop commented Nov 25, 2019

So it seems to me that it for sure will be slower, but only because we restrict the number of jobs to X rather than 10X.

I also think so and that would affect any dvc user, even on linux, which is absolutely no bueno and would slow down big files upload/(and probably download) for everyone. So we what we have here is a tradeoff between "large files" and "large number of files" scenarios, which are both core scenarios for us. If choosing between lowering --jobs for s3 remote and altering aws config, I would choose the former, as it is more obvious to adjust. Maybe there are some better approaches here? 🙂

Do you think we should do some performance tests whether the upload of a few big files is faster when using default s3 settings than when creating single threads for each one?

I bet a ramen that it will be slower for large files 🙂

@pared
Copy link
Contributor

pared commented Nov 25, 2019

Ok so let's sum up what is going on:

Problem:

For different remotes, jobs value does not necessarily correspond to the number of open file descriptors at maximum capacity. We need somehow to connect those values to prevent user from starting the upload if he will be unable to finish.

Possible solution

  1. just limit default jobs number for s3 (quick fix, won't solve the problem)
  2. estimate the maximal number of file descriptors and, if necessary, adjust the number of jobs so that we do not exceed the estimated available number of file descriptors.

I think the proper solution would be the second one, though there are few things to consider:

  • cache file descriptors are not the only thing open during upload (for example in s3 number of cache file descriptors corresponds to the number of open sockets), also I don't think that we can just use all the available limit of open file descriptors, we would probably need to leave some file descriptors for process to utilize.
  • also, in the case described above, the user might get his jobs restricted by us, which might be unpleasant to more advanced users when they will not understand why we throttle value chosen by them. Though, proper description in documentation might solve this

Also:
3. We could catch the error and make user retry with a smaller number of jobs, though I believe that this would be frustrating when one will have to restart push 5 times to adjust the number of jobs. Also handling of the error might be different for different os-es (We know that in macos and Linux its OSError with code 24, but we would need to find out how to handle it in Windows.)

Notes

  • we could use resource.getrlimit(resource.RLIMIT_NOFILE)[0] to access file descriptor limit
  • For most remotes, it is not a problem, though it would be good to check whether other external packages (azure, gs, pyarrow, oss, paramiko) does not provide built-in optimization and parallelization for upload.

@pared
Copy link
Contributor

pared commented Nov 28, 2019

Related:
Suor/funcy#81

@Suor
Copy link
Contributor

Suor commented Nov 28, 2019

We could catch the error and make user retry with a smaller number of jobs,

We can adjust that (Remote.jobs or s3 max_concurrency or both) automatically without a user needing to do anything. We can even show a WARNING that transfer is automatically throttled because of ulimit and suggest a HINT.

BTW, there are similar scenarios with other resources like SSH/SFTP sessions. Now we set the conservative defaults, which makes dvc slower for many users, without them even knowing that, which in its turn makes a perception of dvc worse.

@pared
Copy link
Contributor

pared commented Nov 29, 2019

Ok, after discussion with @efiop I would like to get back to the implementation idea of this one, as it seems we imagined it differently.

Here are some notes:

  1. Adjustment of jobs should happen only in case of default jobs number - we don't wont to limit the user if he knows what is he doing.
  2. In case of huge number of jobs provided by the user we should probably warn him how can it break his push.
  3. In the case of Too many files open we should catch it and log information that reducing jobs or increasing ulimit might help.

How does it sound? @efiop, @Suor?

@efiop
Copy link
Contributor

efiop commented Nov 29, 2019

@pared

  1. Adjustment of jobs should happen only in case of default jobs number - we don't wont to limit the user if he knows what is he doing.

Maybe we could simply set JOBS for S3 to be ulimit / some_sanity_factor instead of 4 * NCPU? Though, I'm pretty sure there might be some other side-effects when there are too many jobs even if ulimit is sufficient... 🙁 So maybe 4 * NCPU but no more than unlimit / some_sanity_factor?

  1. In case of huge number of jobs provided by the user we should probably warn him how can it break his push.

But how do we know when it is too many? :) When we hit "Too many files open"? In that case, how is this different from point 3 ?

EDIT: ah, got it, you mean that it is closer to 1 , so we would print the warning if we are over ulimit / sanity_factor.

  1. the case of Too many files open we should catch it and log information that reducing jobs or increasing ulimit might help.

Sounds good to me 🙂

@Suor
Copy link
Contributor

Suor commented Nov 30, 2019

I think we should decide do we want dvc be a little bit aggressive to be fast. If yes then we should implement automatic degrading with a WARNING and a HINT and set default high. If no we might catch an error, stop, show a hint to either reduce jobs or increase ulimit.

I am in favor of being aggressive here. Reasons:

  • faster by default
  • utilizing as lot of resources as possible

@Suor
Copy link
Contributor

Suor commented Nov 30, 2019

In the future we might apply this approach to other situations. Network speed and stability might vary (i.e. a user connecting from different locations), available SSH/SFTP connections might vary (bound by other user), number of available sockets might vary (opened by some other process), etc. We should adjust and work as fast as possible in current situation, asking a user to reconfigure things each time is boresome.

Going this way we might drop some config/cmdline options in the future, like --jobs.

@efiop
Copy link
Contributor

efiop commented Nov 30, 2019

@Suor Great points! Though, estimating the maximum tolerances might be tricky, as once we reach them, other stuff might break (maybe even in the handling code) :) The "error and hint" approach is simple to get right quickly. Maybe let's do the latter and then keep the discussion for the former?

@efiop
Copy link
Contributor

efiop commented Nov 30, 2019

Not sure we will ever be able to get rid of --jobs, as some people might be pushing from multiple repos and we might hit system limits there, so some might want to intentionally slow it down.

@pared
Copy link
Contributor

pared commented Dec 2, 2019

I think that the "aggressive" approach is the only way to go. If we will throw error and hint to make the user change the number of jobs, I think that will result in many angry users getting "fix number of jobs" error after few minutes of successful upload. That is definitely not a friendly user experience.

@pared
Copy link
Contributor

pared commented Dec 2, 2019

Also, @Suor, when we talk about auto degrading in case of default jobs number, right? Or do you think it should also be applied in case of --jobs provided?

@Suor
Copy link
Contributor

Suor commented Jan 1, 2020

@pared ideally we won't have --jobs option at all. We consume whatever resources we can or use an aggressive default and then adjust on errors. We might provide some hints if we can guess that something might be improved by user, e.g. raising ulimit, but will never break because of the limits.

A good example is dynamic chunk size in dvc.remote.gs:

  • user doesn't need to configure it
  • we start with an aggressive default used to be 100mb, now 10mb
  • fall back to smaller values progressively
  • only fail if the smallest value still doesn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants