Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: --set-params forces YAML 1.2 (breaks code that uses 1.1) #5971

Closed
flippedcoder opened this issue May 5, 2021 · 36 comments
Closed
Labels
A: experiments Related to dvc exp discussion requires active participation to reach a conclusion enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint product: VSCode Integration with VSCode extension research

Comments

@flippedcoder
Copy link

flippedcoder commented May 5, 2021

Bug Report

dvc exp run --set-param lr=0.000001

Here's the error:

Traceback (most recent call last):
  File "/Users/milecia/Repos/checkpoints-tutorial/train.py", line 123, in <module>
    main()
  File "/Users/milecia/Repos/checkpoints-tutorial/train.py", line 110, in main
    train(model, x_batch, y_batch, params["lr"], params["weight_decay"])
  File "/Users/milecia/Repos/checkpoints-tutorial/train.py", line 45, in train
    optimizer = torch.optim.Adam(model.parameters(), lr=lr,
  File "/Users/milecia/Repos/checkpoints-tutorial/.venv/lib/python3.9/site-packages/torch/optim/adam.py", line 36, in __init__
    if not 0.0 <= lr:
TypeError: '<=' not supported between instances of 'float' and 'str'

Description

The problem is that the yaml library only supports yaml 1.1 when DVC uses yaml1.2. This issue seems to happen when exponents are generated.

The data science community and the Python ecosystem in general is stuck in YAML1.1. We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).

Reproduce

  1. Go to the checkpoints tutorial repo
  2. Replace from ruamel.yaml import YAML with import yaml
  3. Replace these 2 lines yaml=YAML(typ='safe') params = yaml.load(f) with params = yaml.safe_load(f)
  4. dvc exp run --set-param lr=0.000001
  5. The lr in params.yaml should be updated to an exponent and you get the error above

Expected

Running dvc exp run --set-param lr=0.000001 should start running the experiment with the new lr

Environment information

Output of dvc doctor:

DVC version: 2.1.0 (brew)
---------------------------------
Platform: Python 3.9.4 on macOS-11.3-x86_64-i386-64bit
Supports: azure, gdrive, gs, http, https, s3, ssh, oss, webdav, webdavs
@pmrowla
Copy link
Contributor

pmrowla commented May 5, 2021

This is by design and is documented: https://dvc.org/doc/command-reference/metrics#supported-file-formats (yaml 1.2 is also specified for plots and params)

See also #4281 and the associated PR


Regarding the checkpoints tutorial, we should really be using ruamel and not PyYAML in any of the iterative example repos for this specific reason.

@dberenbaum
Copy link
Collaborator

dberenbaum commented May 5, 2021

@skshetry I know you suggested this issue. Was there something you had in mind?

Edit: I missed your response in Slack, which I'm pasting below.

@dave Berenbaum, it's more of a compatibility issue. The data science community and the python ecosystem in general is stuck in YAML1.1. We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).

@dberenbaum
Copy link
Collaborator

Also related to #5777. Both issues seem to stem from dvc parsing the provided values instead of replacing them as is.

@dberenbaum
Copy link
Collaborator

The data science community and the Python ecosystem in general is stuck in YAML1.1.

Agreed, and it's not obvious when you get the error that it's a YAML 1.1/1.2 issue. We can change the tutorials, but it doesn't help users who run into the same error.

@dmpetrov
Copy link
Member

dmpetrov commented May 5, 2021

We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides.

That seems like a viable solution.

We can change the tutorials, but it doesn't help users who run into the same error.

Yes. The tutorials can help mitigate the issue but it isa not a solution. A good portion of users will get the error even the ones who do not use Python at all.

@jorgeorpinel jorgeorpinel changed the title DVC is not compatible with yaml 1.1 exp run: --set-params is not compatible with yaml 1.1 May 5, 2021
@jorgeorpinel jorgeorpinel added the bug Did we break something? label May 5, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 5, 2021

DVC is not compatible with yaml 1.1

This is by design and is documented

Agree. Maybe the problem (or part of) begins with the crash and tracelog, which seems to indicate that the error happened in the user code. I also wonder what it would look like if the user code wasn't Python...

Could we start by catching it somehow and presenting a meaningful error message like ERROR: Please update your 'param.yam' file to YAML 1.2 ? To my mind that would remove the "bug" part from this issue.

@pmrowla
Copy link
Contributor

pmrowla commented May 6, 2021

The error here is happening in the (example) user code - the params.yaml is already YAML 1.2, the problem is that the user tries to open it with PyYAML (which only supports 1.1).

@skshetry skshetry added enhancement Enhances DVC p1-important Important, aka current backlog of things to do research and removed bug Did we break something? labels May 6, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 6, 2021

I see. I thought this only happened when the user code is run via exp run (not manually, for example).

Yeah nvmd I got it now: exp run -P makes the params file YAML 1.2, which causes the cmd to fail.

@jorgeorpinel jorgeorpinel changed the title exp run: --set-params is not compatible with yaml 1.1 exp run: --set-params forces YAML 1.2 (breaks code that uses 1.1) May 6, 2021
@dberenbaum
Copy link
Collaborator

Since 0.000001 is valid in both YAML 1.1 and 1.2, how is a user supposed to know that DVC will convert this to 1e-06? I don't think it's clear how the values passed to -S get replaced. What about non-YAML params files where it's not obvious that the values in -S are first parsed as YAML? Are the specs for all of the supported formats so similar that this will never be an issue?

@pmrowla
Copy link
Contributor

pmrowla commented May 7, 2021

Other than the CLI value for -S being parsed as YAML, this issue doesn't affect non-YAML params files. They are written back into the params file using libraries specific to each format, since each format has its own specification for how a floating point number should be serialized.

The point is that however ruamel decides to serialize 0.000001 (in decimal or exponential form) it should not matter since they are both equivalent and valid YAML 1.2. If the user wants to use YAML params files, they should be processing them as YAML 1.2 in their own code.

edit: regarding parsing scalar values passed to -S according to YAML 1.2 rules using ruamel, this should not cause any issues with users writing values "intended" to be JSON scalars. YAML 1.2 specifically exists to bring it in line with JSON (JSON is a formal subset of YAML 1.2). Parsing them as YAML 1.1 would cause issues, as 1e-06 is a float in JSON (and YAML 1.2), but is a string in YAML 1.1

for TOML, they only implement a partial set of the JSON rules for parsing scientific notation. For any users that are actually using TOML params, they will likely be entering floats as decimals in the first place (because of TOML's very limited support for scientific notation), in which case we will still handle them properly.

@pmrowla
Copy link
Contributor

pmrowla commented May 7, 2021

Also, regarding:

We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).

I don't think having a modified YAML serialization library specific to --set-param is the answer. It does not make sense for DVC to be mixing different YAML versions between params files modified with --set-param, dvc.lock and dvc.yaml

@jorgeorpinel
Copy link
Contributor

If the user wants to use YAML params files, they should be processing them as YAML 1.2 in their own code

But I think that is the problem. We are expecting/forcing users to do that.

@pmrowla
Copy link
Contributor

pmrowla commented May 7, 2021

I don't see why that is a problem, or how it is any different from how we also force users to format the data within their JSON/Python/TOML params and metrics files in a specific way.

e:
If the user's existing params or metrics data is structured with a list as the top-most node, we force them to change it, as DVC params/metrics only work if the top-most node is a key:value dictionary. IMO requiring a specific data format (YAML 1.2) is just as reasonable as requiring that the data be structured in a specific way. Either way, we are forcing the user to make changes to their existing code for handling their params/metrics.

@dberenbaum
Copy link
Collaborator

Also, regarding:

We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).

I don't think having a modified YAML serialization library specific to --set-param is the answer. It does not make sense for DVC to be mixing different YAML versions between params files modified with --set-param, dvc.lock and dvc.yaml

I don't think the suggestion requires mixing YAML versions since the proposed format is valid YAML 1.2, nor does it have to only apply to --set-param. Is there some spec or standard about the preferred format for floats in YAML 1.2? See https://yaml.org/spec/1.2/spec.html#id2804092, which details the regex match for floats but doesn't recommend any format. If not, it seems like this is just a choice made by ruamel.yaml (which as far as I can tell just uses the Python repr/str format), which isn't apparent to users.

I know it's messy to have DVC define the format (maybe we can raise the idea of a backwards-compatible style with the ruamel.yaml maintainers if you feel that strongly), but I don't think we can ignore the usability concerns strictly out of principle. I can't even find a YAML 1.2 parser for R, for example. Open to other solutions, but I don't think doing nothing is an option.

@pmrowla
Copy link
Contributor

pmrowla commented May 8, 2021

So are we going to document that params and metrics files should be YAML 1.1, but dvc.lock and dvc.yaml should be YAML 1.2? This seems like a weird inconsistency for DVC to have.

Also, what do we do if a user does dvc run --set-param 1e-06? If we are going off of YAML 1.1 rules, this is explicitly a string. If we then go and serialize this as a string in a JSON file, it will break the user's code.

@jorgeorpinel jorgeorpinel added the discussion requires active participation to reach a conclusion label May 8, 2021
@jorgeorpinel
Copy link
Contributor

I don't see why that is a problem

@pmrowla this was deemed problematic by leadership AFAIR. I would agree that it's not great UX, especially if YAML 1.1 is still dominant (e.g. yaml Python module). And it's very hard to understand what went wrong in these situations.

we also force users to format the data within their JSON/Python/TOML params and metrics files

Not sure that's formatting if you're referring to data structure / schema.

requiring a specific data format is just as reasonable as requiring that the data be structured in a specific way

Schema reqs are separate from the format version, I think. The problem here is that exp run -S updates a dependency (in a way), and the project may not be able to support it.

But good Q: can the same issue happen with other formats? e.g. you use a py2 params file and exp run -P turns it into py3. If so maybe this issue it a bit broader than just YAML...

document that params and metrics files should can be YAML 1.1, but dvc.yaml should be YAML 1.2? This seems like a weird inconsistency

🤺 good point, that does seem a little inconsistent. However I don't think dvc.yaml (a DVC-specific schema) and params/metrics files (glorified INI files) or plots files (data maps) are in the same category. We could simply not specify a version for all the latter (we currently don't for plots BTW).

what do we do if a user does dvc run --set-param 1e-06

🤺🤺🤺

@jorgeorpinel
Copy link
Contributor

And it's very hard to understand what went wrong in these situations.

Is there at least something we can do about this ☝️ ?

@pmrowla
Copy link
Contributor

pmrowla commented May 8, 2021

🤺 good point, that does seem a little inconsistent. However I don't think dvc.yaml (a DVC-specific schema) and params/metrics files (glorified INI files) or plots files (data maps) are in the same category. We could simply not specify a version for all the latter (we currently don't for plots BTW).

It does matter for params/metrics/plots though, since the YAML version affects whether or not some values are treated as floats or strings. We show (numeric) deltas in diffs, and we plot numeric values. Whether the parsed type of 1e-06 is a supposed to be string or a float affects how it can (or cannot) be diffed and plotted.

The type also affects DVC pipeline behavior. If a DVC-tracked YAML 1.2 param dependency is written to dvc.lock as the floating point value 0.000001, changing it to 1e-06 in params.yaml would not trigger the stage to be run during dvc repro (since the param values are numerically equivalent). If we are handling params as YAML 1.1, then changing it to 1e-06 in params.yaml should trigger the stage to be rerun, since the floating point number 0.000001 is not equal to the string "1e-06"


Is there at least something we can do about this ☝️ ?

There is no way for us to tell what YAML version a user intended unless they explicitly use the optional
%YAML 1.1 or %YAML 1.2 identifier at the top of their files (which no one ever uses, and almost no YAML serialization library writes by default).

@pmrowla
Copy link
Contributor

pmrowla commented May 8, 2021

Just to be clear, I'm not a data scientist, and at the end of the day it doesn't really matter to me whether we use YAML 1.1 or 1.2. The main issue here is that we need to pick one and explicitly tell people that's what they need to use. Trying to halfway support both 1.1 and 1.2 should not be an option for the reasons I've already outlined.

I do think it makes more sense for us to use 1.2 since 1.2's parsing rules are consistent with JSON's, and removes any ambiguity about how exp run --set-param values should be parsed. (And we already had a similar 1.1 vs 1.2 discussion 6 months ago and decided to go with 1.2, and documented that decision)

@dberenbaum
Copy link
Collaborator

We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect --set-param, not the dvc.lock and yaml files).

@pmrowla This suggestion still uses YAML 1.2. I don't think anyone has suggested that DVC use YAML 1.1, just that users may be doing so.

The only issues I can find with this suggestion are:

  1. It means DVC has to include some minor manual formatting. Again, I'm open to raising this with the ruamel.yaml project since it doesn't look like there's any particular justification for their current format, and I don't see much value in being intentionally backwards-incompatible when equally viable backwards-compatible solutions exist.
  2. It creates a difference between --set-param and other files like dvc.lock. Again, I'm open to discussing whether it should apply everywhere.

Let me know if you see other issues or have other suggestions.

@pmrowla
Copy link
Contributor

pmrowla commented May 10, 2021

@dberenbaum it still doesn't address the issue of how DVC should parse exponents that a user enters without the dot on the command line via --set-param or in YAML params/metrics files (when diffing/plotting)?

If a user that is using YAML 1.1 enters --set-param foo=1e-06 we will parse it as a YAML 1.2 float. On serialization (with the proposed change), we will we change it to foo: 1.0e-06 to keep backwards compatibility with YAML 1.1's float syntax. Except that what we really just did was break the user's YAML 1.1 file by changing an explicit string to an explicit floating point number


I don't think anyone has suggested that DVC use YAML 1.1, just that users may be doing so.

This is the actual problem (that the expectation is it's ok for DVC to use 1.2 while users are on 1.1). Setting aside the floating point edge case discussion, if DVC is not using the same YAML version as the user, and tries to do "backwards compatible YAML 1.2", it introduces ambiguity and potentially unexpected behavior with regard to pipeline reproduction.

If a user is mixing YAML 1.1 boolean strings in their params files it will break DVC behavior. In YAML 1.1, yes/true and no/false are supposed to be equivalent when DVC evaluates whether or not a param dependency has changed (and a stage needs to be rerun). If DVC just parses everything as YAML 1.2, yes and true are no longer equivalent, since yes is now an explicit string and not an explicit boolean.

Even if the user is not mixing yes/true usage, if they forget to use consistent capitalization it will still be an issue, since by YAML 1.1 rules, Yes and yes are equivalent (both evaluate to boolean True), but by YAML 1.2 rules they are not (both are explicit non-equal strings)


Basically, yes, modifying DVC's YAML serializer to include the dot when writing floats in scientific notation will close this particular ticket. But this is not an actual solution to the underlying problem, which is that YAML 1.1 and 1.2 are two different things (that differ in more ways than just "serializing scientific notation") and that if we try to support both at the same time, there will be more tickets just like this one that get raised in the future.

I can't even find a YAML 1.2 parser for R, for example.

But R does have more than one JSON parser, which is conveniently another supported params/metrics format in DVC, and does not have any of the version/inconsistency issues you get with YAML.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 10, 2021

On the Q of scientific notation, do we actually expect users to need that? If not and if not allowing that would reduce the UX problems, I think we should consider it.

@pmrowla
Copy link
Contributor

pmrowla commented May 12, 2021

I'd argue that it's difficult to degug in the sense that understanding why your code which previously worked now has some bug after using dvc exp. It looks as it DVC broke it.

Again I would like to stress here that the user's code "worked" before using dvc exp only in the sense that they would successfully parse their params file using a YAML 1.1 parser. However, their "working" code was already broken/incompatible with DVC in the sense that parsing things as YAML 1.1 will cause the unexpected pipeline behavior explained in #5971 (comment)


there are already a couple of other issues where parsing causes other issues (#5777 and #5868).

I wouldn't consider either of those related to this issue


  1. Detect the version of the YAML file

This is essentially impossible

@iesahin
Copy link

iesahin commented May 12, 2021

I think one part of the problem is the difference between YAML 1.1 and YAML 1.2 looks minuscule from the user POV.

It should have been called YAML 2 if it's backward incompatible. I feel bad when technically half-baked formats become so popular. Markdown is another such case.

Is there a way to support both YAML 1.1 and 1.2, in all .yaml files, including dvc.yaml and dvc.lock? What are the edge cases?

Supposing, hypothetically, we create another markup language compatible with both versions, what are the ambiguous cases to decide? Can we discuss and document them one-by-one? Is it feasible?

@dberenbaum
Copy link
Collaborator

dberenbaum commented May 12, 2021

They are not specific to --set-param, but I thought they were about parsing. I thought that all of these issues occur because DVC parses the provided values, potentially changing their representation in the process, rather than taking the values as string literals. From what I read:

Am I misunderstanding?

There's nothing inherently wrong about the current parsing implementation, and it has value (like sorting experiments and deduplicating params values with different formatting). I just wanted to clarify how I see these issues as related, and that it is nonobvious to users that DVC doesn't replace the text as is, which can lead to unexpected results.

@casperdcl
Copy link
Contributor

I feel bad when technically half-baked formats become so popular.

being half-baked is a feature not a bug :) Same reason why Esperanto probably won't manage to replace English as lingua franca

@iesahin
Copy link

iesahin commented May 12, 2021

being half-baked is a feature not a bug :) Same reason why Esperanto probably won't manage to replace English as lingua franca

Then there is no reason to perpetuate this feature in DVC maybe :) Natural languages are inherently half-baked, because our understanding of the world we live in is half-baked, but let's not dive into this discussion here :)

If it's possible to cover differences between 1.1 and 1.2 in a sensible manner within a feasible update, it may be better not to leave some users in buggy cold hands of YAML 1.1. Otherwise, let them learn to upgrade their code. I just would like to know how much work would be needed to cover the edge cases before making my mind.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented May 13, 2021

To summarize the problem is the perceived UX. It can feel as if DVC is breaking your code and it's unclear what actually happened from the traceback. I also agree not to support YAML 1.1

But if there's no way to prevent or catch these situations in order to present a clear warning or error message, then the only option left is to be very emphatic in all the related docs *and* in the command help outputs that we hate DVC is not compatible with YAML 1.1

@dmpetrov
Copy link
Member

Create a hack that will make dvc write 1.1 compatible numbers for that edge case. We could invest some time into it to research further, but so far:

@efiop how hard it would be to implement this hack?

PS: I'm wondering about this issue since in some projects the transition to yaml 2.0 might be not as straightforward as in our simple code examples. Also, some of the arguments above assume the user's code is written in Python which might be also not correct in some cases (OpenCV library uses yaml 1.0, not even 1.1.).

@skshetry skshetry added the A: experiments Related to dvc exp label May 27, 2021
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 24, 2021
@mattseddon mattseddon added the product: VSCode Integration with VSCode extension label Dec 15, 2021
bobertlo added a commit to bobertlo/dvc.org that referenced this issue May 21, 2022
See iterative/dvc#5971

adds note to parameter values section metioning scientific notation for SEO
use ruamel.yaml in example instead of PyYAML and add note
bobertlo added a commit to bobertlo/dvc.org that referenced this issue May 24, 2022
See iterative/dvc#5971

use ruamel.yaml in examples instead of PyYAML and add warnings

add note to parameter values section metioning scientific notation for SEO
daavoo pushed a commit to iterative/dvc.org that referenced this issue May 24, 2022
…3579)

See iterative/dvc#5971

use ruamel.yaml in examples instead of PyYAML and add warnings

add note to parameter values section metioning scientific notation for SEO
@dberenbaum
Copy link
Collaborator

Another occurrence of this issue and a nice summary of the problem:

OK, I think I understand now. The conclusion is that the params.yaml you produce requires a YAML 1.2 compatible reader. And PyYAML is not, which is what I use in my code currently. Switching to ruaml.yaml will fix it. Got it :-)

Originally posted by @aschuh-hf in #8466 (comment)

@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Feb 6, 2023
@mattseddon mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp discussion requires active participation to reach a conclusion enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint product: VSCode Integration with VSCode extension research
Projects
None yet
Development

No branches or pull requests