-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp run
: --set-params
forces YAML 1.2 (breaks code that uses 1.1)
#5971
Comments
This is by design and is documented: https://dvc.org/doc/command-reference/metrics#supported-file-formats (yaml 1.2 is also specified for plots and params) See also #4281 and the associated PR Regarding the checkpoints tutorial, we should really be using ruamel and not PyYAML in any of the iterative example repos for this specific reason. |
@skshetry I know you suggested this issue. Was there something you had in mind? Edit: I missed your response in Slack, which I'm pasting below.
|
Also related to #5777. Both issues seem to stem from dvc parsing the provided values instead of replacing them as is. |
Agreed, and it's not obvious when you get the error that it's a YAML 1.1/1.2 issue. We can change the tutorials, but it doesn't help users who run into the same error. |
That seems like a viable solution.
Yes. The tutorials can help mitigate the issue but it isa not a solution. A good portion of users will get the error even the ones who do not use Python at all. |
exp run
: --set-params
is not compatible with yaml 1.1
Agree. Maybe the problem (or part of) begins with the crash and tracelog, which seems to indicate that the error happened in the user code. I also wonder what it would look like if the user code wasn't Python... Could we start by catching it somehow and presenting a meaningful error message like |
The error here is happening in the (example) user code - the params.yaml is already YAML 1.2, the problem is that the user tries to open it with PyYAML (which only supports 1.1). |
Yeah nvmd I got it now: |
exp run
: --set-params
is not compatible with yaml 1.1exp run
: --set-params
forces YAML 1.2 (breaks code that uses 1.1)
Since |
Other than the CLI value for The point is that however ruamel decides to serialize edit: regarding parsing scalar values passed to for TOML, they only implement a partial set of the JSON rules for parsing scientific notation. For any users that are actually using TOML params, they will likely be entering floats as decimals in the first place (because of TOML's very limited support for scientific notation), in which case we will still handle them properly. |
Also, regarding:
I don't think having a modified YAML serialization library specific to |
But I think that is the problem. We are expecting/forcing users to do that. |
I don't see why that is a problem, or how it is any different from how we also force users to format the data within their JSON/Python/TOML params and metrics files in a specific way. e: |
I don't think the suggestion requires mixing YAML versions since the proposed format is valid YAML 1.2, nor does it have to only apply to I know it's messy to have DVC define the format (maybe we can raise the idea of a backwards-compatible style with the |
So are we going to document that params and metrics files should be YAML 1.1, but Also, what do we do if a user does |
@pmrowla this was deemed problematic by leadership AFAIR. I would agree that it's not great UX, especially if YAML 1.1 is still dominant (e.g.
Not sure that's formatting if you're referring to data structure / schema.
Schema reqs are separate from the format version, I think. The problem here is that But good Q: can the same issue happen with other formats? e.g. you use a py2 params file and
🤺 good point, that does seem a little inconsistent. However I don't think dvc.yaml (a DVC-specific schema) and params/metrics files (glorified INI files) or plots files (data maps) are in the same category. We could simply not specify a version for all the latter (we currently don't for plots BTW).
🤺🤺🤺 |
Is there at least something we can do about this ☝️ ? |
It does matter for params/metrics/plots though, since the YAML version affects whether or not some values are treated as floats or strings. We show (numeric) deltas in diffs, and we plot numeric values. Whether the parsed type of The type also affects DVC pipeline behavior. If a DVC-tracked YAML 1.2 param dependency is written to
There is no way for us to tell what YAML version a user intended unless they explicitly use the optional |
Just to be clear, I'm not a data scientist, and at the end of the day it doesn't really matter to me whether we use YAML 1.1 or 1.2. The main issue here is that we need to pick one and explicitly tell people that's what they need to use. Trying to halfway support both 1.1 and 1.2 should not be an option for the reasons I've already outlined. I do think it makes more sense for us to use 1.2 since 1.2's parsing rules are consistent with JSON's, and removes any ambiguity about how |
@pmrowla This suggestion still uses YAML 1.2. I don't think anyone has suggested that DVC use YAML 1.1, just that users may be doing so. The only issues I can find with this suggestion are:
Let me know if you see other issues or have other suggestions. |
@dberenbaum it still doesn't address the issue of how DVC should parse exponents that a user enters without the dot on the command line via If a user that is using YAML 1.1 enters
This is the actual problem (that the expectation is it's ok for DVC to use 1.2 while users are on 1.1). Setting aside the floating point edge case discussion, if DVC is not using the same YAML version as the user, and tries to do "backwards compatible YAML 1.2", it introduces ambiguity and potentially unexpected behavior with regard to pipeline reproduction. If a user is mixing YAML 1.1 boolean strings in their params files it will break DVC behavior. In YAML 1.1, Even if the user is not mixing Basically, yes, modifying DVC's YAML serializer to include the dot when writing floats in scientific notation will close this particular ticket. But this is not an actual solution to the underlying problem, which is that YAML 1.1 and 1.2 are two different things (that differ in more ways than just "serializing scientific notation") and that if we try to support both at the same time, there will be more tickets just like this one that get raised in the future.
But R does have more than one JSON parser, which is conveniently another supported params/metrics format in DVC, and does not have any of the version/inconsistency issues you get with YAML. |
On the Q of scientific notation, do we actually expect users to need that? If not and if not allowing that would reduce the UX problems, I think we should consider it. |
Again I would like to stress here that the user's code "worked" before using
I wouldn't consider either of those related to this issue
This is essentially impossible |
I think one part of the problem is the difference between YAML 1.1 and YAML 1.2 looks minuscule from the user POV. It should have been called YAML 2 if it's backward incompatible. I feel bad when technically half-baked formats become so popular. Markdown is another such case. Is there a way to support both YAML 1.1 and 1.2, in all Supposing, hypothetically, we create another markup language compatible with both versions, what are the ambiguous cases to decide? Can we discuss and document them one-by-one? Is it feasible? |
They are not specific to
Am I misunderstanding? There's nothing inherently wrong about the current parsing implementation, and it has value (like sorting experiments and deduplicating params values with different formatting). I just wanted to clarify how I see these issues as related, and that it is nonobvious to users that DVC doesn't replace the text as is, which can lead to unexpected results. |
being half-baked is a feature not a bug :) Same reason why Esperanto probably won't manage to replace English as lingua franca |
Then there is no reason to perpetuate this feature in DVC maybe :) Natural languages are inherently half-baked, because our understanding of the world we live in is half-baked, but let's not dive into this discussion here :) If it's possible to cover differences between 1.1 and 1.2 in a sensible manner within a feasible update, it may be better not to leave some users in buggy cold hands of YAML 1.1. Otherwise, let them learn to upgrade their code. I just would like to know how much work would be needed to cover the edge cases before making my mind. |
To summarize the problem is the perceived UX. It can feel as if DVC is breaking your code and it's unclear what actually happened from the traceback. I also agree not to support YAML 1.1 But if there's no way to prevent or catch these situations in order to present a clear warning or error message, then the only option left is to be very emphatic in all the related docs *and* in the command help outputs that |
@efiop how hard it would be to implement this hack? PS: I'm wondering about this issue since in some projects the transition to yaml 2.0 might be not as straightforward as in our simple code examples. Also, some of the arguments above assume the user's code is written in Python which might be also not correct in some cases (OpenCV library uses yaml 1.0, not even 1.1.). |
See iterative/dvc#5971 adds note to parameter values section metioning scientific notation for SEO use ruamel.yaml in example instead of PyYAML and add note
See iterative/dvc#5971 use ruamel.yaml in examples instead of PyYAML and add warnings add note to parameter values section metioning scientific notation for SEO
…3579) See iterative/dvc#5971 use ruamel.yaml in examples instead of PyYAML and add warnings add note to parameter values section metioning scientific notation for SEO
Another occurrence of this issue and a nice summary of the problem:
Originally posted by @aschuh-hf in #8466 (comment) |
BugReportdvc exp run --set-param lr=0.000001
Here's the error:
Description
The problem is that the
yaml
library only supports yaml 1.1 when DVC uses yaml1.2. This issue seems to happen when exponents are generated.The data science community and the Python ecosystem in general is stuck in YAML1.1. We could try to be compatible by generating exponents with the dot at least, that way we could be compatible on both sides. I am not sure how straightforward the fix this will be though. (and this should only affect
--set-param
, not the dvc.lock and yaml files).Reproduce
from ruamel.yaml import YAML
withimport yaml
yaml=YAML(typ='safe') params = yaml.load(f)
withparams = yaml.safe_load(f)
lr
in params.yaml should be updated to an exponent and you get the error aboveExpected
Running
dvc exp run --set-param lr=0.000001
should start running the experiment with the newlr
Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: