add: performance and reliability issues #6227

skshetry · 2021-06-25T16:05:30Z

Repeated dvc add is not skipped.
```
$ dvc add data
$ dvc add data
```
In 1.X, it'd have been skipped. And, dvc still deletes the file and tries to restore it from the cache making it slower.
DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.

This is slow and might result in data loss if it happens to fail in between the operations.
DVC deletes the stage file, before even adding those files. This means that if the dvc add operation fails, the existing pointer file is lost, which is the only way to get access to the data.
DVC resets the stages multiple times (only if multiple targets are provided) and forces the stage recollection which is slow.
To the same effect, it resets the internal state of the repo after creating each stage, which also happens to reset dulwich's ignore manager, making it horribly slow if using too many targets (or, -R).

Line 266 in 4e792ae

repo._reset() # pylint: disable=protected-access

The text was updated successfully, but these errors were encountered:

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

* add: do not delete stage files before add Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see #2886 and #3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see #6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec. * add tests * make the test more specific

pared · 2021-09-09T10:32:31Z

DVC uses move-then-checkout logic. It moves the file from the workspace to the cache and then checks it out again, rather than just using copy.

Wasn't this intended to enforce cache link type? I guess in case of copy it would make sense but what about others?

skshetry · 2021-09-09T11:05:26Z

For other links, the one I suggested was to change copy behaviour to be move + link that works atomically.
@efiop also suggested using hardlinks instead.

dberenbaum · 2022-03-17T18:36:36Z

@skshetry Do you think we should include this as part of the data epic?

Because of the way we collect stages and cache them, we were not able to collect them for the `add` without removing them from the workspace. As doing so, we'd have two same/similar stages - one collected from the workspace and the other just created from the `dvc add` in-memory. This would raise errors during graph checks, so we started to delete them and reset them (which is very recently, see iterative#2886 and iterative#3349). By deleting the file before we even do any checks, we are making DVC fragile, and results in data loss for the users with even simple mistakes. This should make it more reliable and robust. And, recently, we have started to keep state of a lot of things, that by resetting them on each stage, we waste a lot of performance, especially on gitignores. We cache the dulwich's IgnoreManager, which when resetted too many times, will waste a lot of our time just collecting them again next time (see iterative#6227). It's hard to say how much this improves, as this very much depends on no. of gitignores in the repo (which can be assumed to be quite in number for a dvc repo) and the amount of files that we are adding (eg: `-R` adding a large directory). On a directory with 10,000 files (in a datadet-registry repo), creating stages on `dvc add -R` went from 64 files/sec to 1.1k files/sec.

skshetry · 2024-08-19T12:57:14Z

Closed by

and, released in https://github.com/iterative/dvc/releases/tag/3.54.0.

This was referenced Jun 28, 2021

add: do not delete stage files before add #6239

Merged

Common UI improvements #5392

Closed

pared added performance improvement over resource / time consuming tasks ui user interface / interaction enhancement Enhances DVC labels Sep 9, 2021

skshetry mentioned this issue Nov 15, 2021

add: incredibly slow #6977

Closed

daavoo added the A: data-management Related to dvc add/checkout/commit/move/remove label Feb 22, 2022

skshetry mentioned this issue Jul 30, 2024

state/cache: implement get_many/set_many iterative/dvc-data#522

Merged

skshetry closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add: performance and reliability issues #6227

add: performance and reliability issues #6227

skshetry commented Jun 25, 2021 •

edited

Loading

pared commented Sep 9, 2021

skshetry commented Sep 9, 2021

dberenbaum commented Mar 17, 2022

skshetry commented Aug 19, 2024

add: performance and reliability issues #6227

add: performance and reliability issues #6227

Comments

skshetry commented Jun 25, 2021 • edited Loading

pared commented Sep 9, 2021

skshetry commented Sep 9, 2021

dberenbaum commented Mar 17, 2022

skshetry commented Aug 19, 2024

skshetry commented Jun 25, 2021 •

edited

Loading