Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: empty files add broken when cache mode is hardlinks #3390

Closed
shcheklein opened this issue Feb 24, 2020 · 2 comments · Fixed by #3428
Closed

add: empty files add broken when cache mode is hardlinks #3390

shcheklein opened this issue Feb 24, 2020 · 2 comments · Fixed by #3428
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@shcheklein
Copy link
Member

DVC Version

DVC version: 0.86.5+f67314.mod
Python version: 3.7.6
Platform: Darwin-18.2.0-x86_64-i386-64bit
Binary: False
Package: None
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

Reproduce

dvc config cache.type hardlink
touch foo
dvc add foo

outputs:

ERROR: no possible cache types left to try out.
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Feb 24, 2020
@dmpetrov dmpetrov added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. and removed triage Needs to be triaged labels Feb 24, 2020
@skshetry skshetry self-assigned this Feb 26, 2020
@skshetry
Copy link
Member

skshetry commented Feb 26, 2020

Found the root cause of this. This happens because of the optimization that we do if the file is empty (i.e. 0 bytes size). We never create a hardlink for the file with 0 bytes size and therefore it fails when we try to verify if the hardlink was created.

dvc/dvc/remote/local.py

Lines 170 to 182 in a9bc65e

# If there are a lot of empty files (which happens a lot in datasets),
# and the cache type is `hardlink`, we might reach link limits and
# will get something like: `too many links error`
#
# This is because all those empty files will have the same checksum
# (i.e. 68b329da9893e34099c7d8ad5cb9c940), therefore, they will be
# linked to the same file in the cache.
#
# From https://en.wikipedia.org/wiki/Hard_link
# * ext4 limits the number of hard links on a file to 65,000
# * Windows with NTFS has a limit of 1024 hard links on a file
#
# That's why we simply create an empty file rather than a link.

I think, is_hardlink() function should be aware of this optimization as well. I'll make a fix tomorrow.

P.S. This only failed if the file was empty (not even a newline). So:

echo > foo && dvc add foo # works because of extra-newline character added by `echo`
touch bar && dvc add bar # does not work as the file is empty

Also, subsequent runs would not fail because before this error is thrown out, the file gets cached (however, deletes file from workspace), i.e.

dvc init --no-scm
dvc config cache.type hardlink
touch foo && dvc add foo  # fails and deletes foo from workspace
touch foo && dvc add foo  # passes

@efiop
Copy link
Contributor

efiop commented Feb 28, 2020

@skshetry Great investigation! 🙏

skshetry added a commit to skshetry/dvc that referenced this issue Mar 10, 2020
efiop pushed a commit that referenced this issue Mar 11, 2020
* add: do not verify hardlink if file is empty

Fixes #3390

* tests: spy

* fixup! tests: spy
@skshetry skshetry reopened this Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants