Skip to content

Long strings dumped to dvc.lock contain extra space upon load #10668

Closed
@janpawlowskiof

Description

@janpawlowskiof

Bug Report

repro: long strings dumped to dvc.lock contain extra space upon load

Edit:

In PR, I replaced the float("inf") with sys.maxsize as suggested in #9397 (comment)

Description

The actual bug seems to be originating from ruamel.yaml, but we should mitigate it here.

How it manifests:

I have this stage indvc.yaml

  faulty_stage:
    cmd: >-
      echo imruneverytime > faulty.txt
    params:
      - fault_parameter_name
    outs:
      - faulty.txt

and this in params.yaml

fault_parameter_name: |
  This is a prompt.
  This is a prompt.
  This is a prompt.
  This is a prompt.
  This is a prompt.

Despite not changing anything, dvc sees the step as changed, but then realizes the step was cached and loads it from cache.

# Running for the first time, expected

bash-5.2$ dvc repro -s faulty_stage
Running stage 'faulty_stage':                                         
> echo imruneverytime > faulty.txt
Updating lock file 'dvc.lock'                                                                                                                                              
Use `dvc push` to send your updates to remote storage.

# Nothing changes here, yet it tries to run, but finally does not, because of the previous run being cached

bash-5.2$ dvc repro -s faulty_stage
Stage 'faulty_stage' is cached - skipping run, checking out outputs   
Updating lock file 'dvc.lock'                                                                                                                                              
Use `dvc push` to send your updates to remote storage.
bash-5.2$ 

Why it occurs:

When long lines are dumped to dvc.lock the line gets wrapped.
The issue happens (only sometimes I think), when this wrap occurs after "\n" character.
When the yaml is then loaded it contains and additional space.

You can see this happen directly in the ruamel.yaml
Image

This can be worked around by adding this line, which makes it so that string is dumped in one line.

yaml.width = float("inf")
Image You can see this fixes the bug.

How it can be solved.

This can be probably solved be adding this line

yaml.width = float("inf")

here

yaml = YAML()

This solves my issue, and while I didn't do any real testing of this, I don't think this should cause any issues elsewhere.

Output of dvc doctor:

DVC version: 3.58.0 (pip)
-------------------------
Platform: Python 3.11.9 on macOS-14.3-arm64-arm-64bit
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.9
Supports:
        http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1)
Config:
        Global: /Users/jpawlowski/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/c7d0d0e3856141270a471060952e5a84

This will hopefully be a one line fix, and I can add a PR for it today.

Best regards!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?upstreamIssues which need to be resolved in an upstream dependency

    Type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions