Description
Bug Report
repro: long strings dumped to dvc.lock contain extra space upon load
Edit:
In PR, I replaced the float("inf")
with sys.maxsize
as suggested in #9397 (comment)
Description
The actual bug seems to be originating from ruamel.yaml
, but we should mitigate it here.
How it manifests:
I have this stage indvc.yaml
faulty_stage:
cmd: >-
echo imruneverytime > faulty.txt
params:
- fault_parameter_name
outs:
- faulty.txt
and this in params.yaml
fault_parameter_name: |
This is a prompt.
This is a prompt.
This is a prompt.
This is a prompt.
This is a prompt.
Despite not changing anything, dvc sees the step as changed, but then realizes the step was cached and loads it from cache.
# Running for the first time, expected
bash-5.2$ dvc repro -s faulty_stage
Running stage 'faulty_stage':
> echo imruneverytime > faulty.txt
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
# Nothing changes here, yet it tries to run, but finally does not, because of the previous run being cached
bash-5.2$ dvc repro -s faulty_stage
Stage 'faulty_stage' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
bash-5.2$
Why it occurs:
When long lines are dumped to dvc.lock
the line gets wrapped.
The issue happens (only sometimes I think), when this wrap occurs after "\n" character.
When the yaml is then loaded it contains and additional space.
You can see this happen directly in the ruamel.yaml
This can be worked around by adding this line, which makes it so that string is dumped in one line.
yaml.width = float("inf")

How it can be solved.
This can be probably solved be adding this line
yaml.width = float("inf")
here
dvc/dvc/utils/serialize/_yaml.py
Line 48 in 7d14acb
This solves my issue, and while I didn't do any real testing of this, I don't think this should cause any issues elsewhere.
Output of dvc doctor
:
DVC version: 3.58.0 (pip)
-------------------------
Platform: Python 3.11.9 on macOS-14.3-arm64-arm-64bit
Subprojects:
dvc_data = 3.16.7
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.9
Supports:
http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1)
Config:
Global: /Users/jpawlowski/Library/Application Support/dvc
System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/c7d0d0e3856141270a471060952e5a84
This will hopefully be a one line fix, and I can add a PR for it today.
Best regards!