Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix line ending bug #90

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Conversation

kekavc24
Copy link
Contributor

@kekavc24 kekavc24 commented Jul 1, 2024

I debugged #89 further while submitting #91 and noted several issues that are closely coupled:

  • Trailing line-break only applied to YamlList or YamlMap and never a YamlScalar
  • Dangling comments and line-breaks left behind if a comment spans multiple lines.

This PR:

  1. Applies a line break after each YamlNode encoded within a block.
  2. Fixes Literal and folded strings cannot handle "\n \n" at the end of a string #89.
  3. Introduces a normalizeEncodeBlock function that dynamically prunes the additional line break if:
  4. Introduces a skipAndExtractComments function which greedily skips any comments and whitespace belonging to the YamlNode being replaced.

Further changes made in #93


  • I’ve reviewed the contributor guide and applied the relevant portions to this PR.
Contribution guidelines:

Note that many Dart repos have a weekly cadence for reviewing PRs - please allow for some latency before initial review feedback.

@kekavc24
Copy link
Contributor Author

kekavc24 commented Jul 4, 2024

@jonasfj I have improved this PoC further by:

1. Lazily looking ahead for comments as that would bring the draft closer to the how YamlEditor currently does its edits.
2. Refactored the functions that edit YamlNode that are encoded as block elements to take advantage of the skipAndExtractComments which makes code a bit straight-forward and easy to understand (entirely subjective until you take a look) 😅

All in all, the existing and currently failing tests kinda vouch for this PoC but I'm ready to push its limit to prove its worth with stronger and more convoluted tests.

See #90, #93 & #94

@jonasfj
Copy link
Member

jonasfj commented Jul 4, 2024

I'm down for taking a look at this.
Currently, traveling and I fear this requires a bit of attention to understand. So I'll try to take a look next week.

I saw you had nice comments explaining some of the methods. That's awesome.

Feel free to give some review hints in the PR description :)

Also if any of these changes could be done independently, please do consider submitting smaller independent PRs. It's much easier to review, which often leads to faster turnaround time.

@kekavc24 kekavc24 force-pushed the fix-line-ending-bug branch from 2555b32 to c7aec85 Compare July 5, 2024 09:35
@kekavc24 kekavc24 marked this pull request as ready for review July 5, 2024 10:07
@kekavc24
Copy link
Contributor Author

kekavc24 commented Jul 5, 2024

I've tried to split the 22 commits over 3 PRs based on the changes I made. #90 -> #93 -> #94.

/// [YamlList] or a top-level [YamlScalar] of the [yaml] string provided.
///
/// [currentEndOffset] represents the end offset of [YamlScalar] or [YamlList]
/// or [YamlMap] being replaced, that is, `end + 1`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably better illustrated with an example.

/// ```dart
/// final yaml = 'my_key: "replaced-value"\n';
/// //                                    ^--- currentEndOffset points to the newline
/// fimal currentEndOffset = 24;
/// ```

ofcourse, I might have counted wrong here 🤣

Also does this mean that currentEndOffset can be greater than the length of the yaml document? (currentEndOffset == yaml.length, so yaml[currentEndOffset] might throw)

Example: k: v will have currentEndOffset: 4 (if I'm replacing v, right).

Copy link
Member

@jonasfj jonasfj Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, if I'm asking dumb questions here, but it might be worth making some examples.

And I think we should make sure we're doing it right, if we get he wrong invariants we'll have a lot of bugs to fix :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also does this mean that currentEndOffset can be greater than the length of the yaml document? (currentEndOffset == yaml.length, so yaml[currentEndOffset] might throw)

currentEndOffset shouldn't be greater than yaml.length when passed in as an argument. May be I should throw if it is?


Example: k: v will have currentEndOffset: 4 (if I'm replacing v, right).

I think 3 since it's always zero-based.


Sorry, if I'm asking dumb questions here, but it might be worth making some examples.

And I think we should make sure we're doing it right, if we get he wrong invariants we'll have a lot of bugs to fix :D

Sure, will add examples.

lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/editor.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
lib/src/utils.dart Outdated Show resolved Hide resolved
Comment on lines 388 to 401
/// Normalizes an encoded [YamlNode] encoded as a string by pruning any
/// dangling line-breaks.
///
/// This function checks the last `YamlNode` of the [update] that is a
/// `YamlScalar` and removes any unwanted line-break within the
/// [updateAsString].
///
/// This is achieved by obtaining the chunk of the [yaml] that is after the
/// current node being replaced using its [nodeToReplaceEndOffset]. If:
/// 1. The chunk has any trailing line-break then the it is left untouched.
/// 2. The node being replaced with [update] is not the last node, then it
/// is left untouched.
/// 3. The terminal node in [update] is a `YamlScalar`, that is,
/// the last [YamlNode] within the [update] that is not a collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding some example.

Also why is it that we don't do this normalization when we create the updateAsString? Wouldn't that be more correct -- I don't know -- I probably don't understand all the context here, that's why I'm asking :D

Copy link
Contributor Author

@kekavc24 kekavc24 Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I’ll add an example.

I do that because yamlEncodeBlock has no context of which method called it or the current structure of the yaml document. It could be:

  1. updateInList or _appendToBlockList or _insertInBlockList for lists
  2. _replaceInBlockMap or _addToBlockMap for block maps
  3. Yaml.update for top level YamlNode.

Thus, the callers should make a call to the function to prune the trailing line break because they have an idea of the yaml's structure.

Additionally, I found it quite advantageous on our part to try and include line breaks for an existing YamlNode being replaced to edits. This reduces bugs and removes the need to try and guess where the next line break is.

It’s better to include it and check later if it was there just before suggesting an edit using a definite endOffset

@kekavc24 kekavc24 closed this Jul 17, 2024
@kekavc24 kekavc24 deleted the fix-line-ending-bug branch July 17, 2024 07:55
@kekavc24 kekavc24 restored the fix-line-ending-bug branch October 22, 2024 14:00
@kekavc24 kekavc24 reopened this Oct 22, 2024
@kekavc24 kekavc24 marked this pull request as draft October 22, 2024 14:02
@jonasfj
Copy link
Member

jonasfj commented Oct 22, 2024

Feel free to mark comments as "Resolved" once you've:

  • addressed the comment, or,
  • replied explaining why you don't want to do anything (not all comments need be addressed).

I think it would be useful to add some examples in the code. Especially for skipAndExtractCommentsInBlock and normalizeEncodeBlock.
I'm not sure I fully understand what they are even supposed to do. To be fair, maybe I just didn't look hard enough at it yet 🤣

> Update function doc and inline comments
> Make function return named record
> Prefer named parameters
> Refactor existing code referencing function
@kekavc24 kekavc24 marked this pull request as ready for review October 26, 2024 17:39
@kekavc24
Copy link
Contributor Author

kekavc24 commented Oct 26, 2024

I've added commits I had moved to the other PRs. Without the changes, the skipAndExtractCommentsInBlock and normalizeEncodedBlock won't make sense. (see failing tests for bonus points)

The existing code was only scanning for the first instance of # and \n but yaml allow for comments spanning multiple lines. A key issue was with blocks lists in yaml since it scanned for -. But comments can also have -? Case in point:

# Existing code works great here for edits
- valueToRemove # Comment with hyphen (-)
- nextValue


# Existing code fails horribly here for edits

- valueToRemove # Comment with hyphen (-)
                     # but spanning (oops, hyphen "-")
                     #
                     # multiple lines (another hyphen "-")
- nextValue

skipAndExtractCommentsInBlock:

  1. Scans ahead until it skips all comments whether it's a single line or spans multiple lines
  2. Can do it lazily (remembers the first \n it saw while scanning ahead if no comments are present) but also greedily (skips both \n and/or comments)

I realized we can do it lazily when making updates within maps or lists i.e for inserts or splices etc. And greedily exclusively for removing elements in maps/lists.

normalizeEncodedBlock just helps us to avoid always falling back to double quotes even for string that can be styled as folded or literal

@kekavc24 kekavc24 requested a review from jonasfj October 26, 2024 18:20
/// Returns the `endOffset` of the last comment extracted that is `end + 1`
/// and a `List<String> comments`. It is recommended (but not necessary) that
/// the caller checks the `endOffset` is still within the bounds of the [yaml].
({int endOffset, List<String> comments}) skipAndExtractCommentsInBlock(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give me a few examples of how this works.

Or maybe a few unit tests, even if we don't keep them, I'd like to understand a bit more what the intention here is.

It seems like it takes a string slice, and returns:

  • Comments from within the slice (not sure why)
  • Offset where the next non-comment thing in the slice begins, or ends? (or did I misunderstand that?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it takes a string slice, and returns:

  • Comments from within the slice (not sure why)
  • Offset where the next non-comment thing in the slice begins, or ends? (or did I misunderstand that?)

Yes. It returns where the non-comment starts but with a nuanced approach based on the bool argument you pass into the greedy parameter.


Could you give me a few examples of how this works.

Yeah, sure. See the examples below.

- valueToRemove # Comment with hyphen (-)
                # but spanning (oops, hyphen "-")

- nextValue
  • greedy as true returns the offset of the - indicating the start of the next value in the list.
  • greedy as false returns the offset of the last \n (line-break) + 1 which is just after the last closing bracket )

Or maybe a few unit tests, even if we don't keep them, I'd like to understand a bit more what the intention here is.

A nice example even without the unit tests would be an element in a list or map with multi-line comments being removed. For example:

Performing YamlEditor(yaml).remove([0]); on the string below:

- valueToRemove # Comment with hyphen (-)
                     # but spanning (oops, hyphen "-")
                     #
                     # multiple lines (another hyphen "-")
- nextValue


# Expected output without this PR

                     # but spanning (oops, hyphen "-")
                     #
                     # multiple lines (another hyphen "-")
- nextValue


# Expected output with PR
- nextValue

The offending code can be found here which is similar to _removeFromBlockMap:

if (start > 0) {
final lastHyphen = yaml.lastIndexOf('-', start - 1);
final lastNewLine = yaml.lastIndexOf('\n', start - 1);
if (lastHyphen > lastNewLine) {
start = lastHyphen + 2;
/// If there is a `-` before the node, we need to check if we have
/// to update the indentation of the next node.
if (index < list.length - 1) {
/// Since [end] is currently set to the next new line after the current
/// node, check if we see a possible comment first, or a hyphen first.
/// Note that no actual content can appear here.
///
/// We check this way because the start of a span in a block list is
/// the start of its value, and checking from the back leaves us
/// easily confused if there are comments that have dashes in them.
final nextHash = yaml.indexOf('#', end);
final nextHyphen = yaml.indexOf('-', end);
final nextNewLine = yaml.indexOf('\n', end);
/// If [end] is on the same line as the hyphen of the next node
if ((nextHash == -1 || nextHyphen < nextHash) &&
nextHyphen < nextNewLine) {
end = nextHyphen;
}
}
} else if (lastNewLine > lastHyphen) {
start = lastNewLine + 1;
}
}
return SourceEdit(start, end - start, '');

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: These thoughts are not fully conceived, I'm still trying to understand.


So if the input for YamlEditor(yaml).remove([0]) is:

- valueToRemove      # Comment with hyphen (-)
                     # but spanning (oops, hyphen "-")
                     #
                     # multiple lines (another hyphen "-")
  # nextValue is awesome:
- nextValue

Then with this PR we'd get:

- nextValue

Right?


I'm just highlighting this example, because detecting that a comment is spanning multiple lines is difficult.

It's kind of obvious to us humans that the comment # nextValue is awesome: is associated with - nextValue and that all the comments with equal indentation are related to - valueToRemove.

But if we tried to make a heuristic to detect this, such heuristic would fail if comments aren't perfectly aligned 🙈 I can see no end of corner cases 🤣


So maybe we have to ask, when faced with YamlEditor(yaml).remove([0]) on:

# A
  # B
- valueToRemove # C
                # D
# E

# F
  # G
- nextValue # H
# I

What comments do we wish to remove / preserve?

At the top of my head, I'm thinking:

  • H and I should definition be preserved.
  • Removing C seems reasonable.
  • Do we want to treat A and B the same way?
    • Would we want to detect that A and B are different based on indentation?
  • Do we want to treat D, E, F and G the same way?
    • Would we want to detect that D and E are not the same based on indentation?
    • Would we want to detect that E and F are not the same based on empty line?
    • Would we want to detect that F and G are not the same based on indentation?

I'm suspecting that maybe it's simplest to say that:

  • A and B are treated the same way: we preserve them.
  • D, E, F and G are treated the same way: we preserve them.

It could be that it's best to say retain comments, unless they are inside the value we're removing (or on the same line). We risk leaving some half gabled comments, but which is worse:

  • Removing too many comments?
  • Retaining half a comment related to a value that's been removed?

If we make heuristics for all this to be more accurate, don't we risk things becoming extremely complicated? 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right?

Yes.

Funnily enough, I totally understand your point of view 😄. That’s how my head was spinning before I wrote skipAndExtractComments.

However, I reduced this PR’s scope to actually do what YamlEditor currently WANTS to do while ensuring we can have folded/literal strings. This is still a PoC.

From the PR, you can see I didn’t change how YamlEditor works just nudged it to do what existing code wants to do.

Personally, I don’t see it as complexity but rather as a challenge to live up to the package’s description the RIGHT WAY since if we provide a way to preserve comments, wouldn’t it be great to at least provide a subjective way to remove them?

If you are open to it, I can come up with an issue by the end of the week with my view and where we can both float ideas on how to tackle this and once we settle on something I will submit a PR. If not, then please indicate what we can keep from this PR and what not to keep.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, then please indicate what we can keep from this PR and what not to keep.

I'll have to admit that I'm not entirely sure what this PR does 🙈

Also that we have things like skipAndExtractCommentsInBlock but we only use endOffset from it, scares me a bit. It might actually makes sense, if I understood what the future plans for skipAndExtractCommentsInBlock was 🤣

If you are open to it, I can come up with an issue by the end of the week with my view and where we can both float ideas on how to tackle this and once we settle on something I will submit a PR.

I think that's a good idea!

Let's outline:

  • What problems we're trying to solve.
  • What the limitations to our approach will be.
  • What will the downsides be? (if we're doing heuristics, there will be scenarios where it works less than perfect, what do those scenarios look like).

once we settle on something I will submit a PR.

Small PRs are MUCH easier to land!

If we can't do it all in one PR, maybe we should create an intermediate branch, try to land small PRs there and then eventually merge the branch into main.

There isn't a whole lot of development on main, so it's not like there'll be a lot of merge conflicts 🤣 (benefits of working in a small repository).

Of course, there is a non-trival risk that we get stuck somewhere along the way don't land anything.

But if we try this route we could perhaps consider writing a test cases upfront.
Maybe we could extend the setup in test/testdata to support more kinds of tests, and maybe also support tests that are known to fail now, but that we want to fix.

Sometimes it's a lot easier to discuss things through examples.

Personally, I don’t see it as complexity but rather as a challenge to live up to the package’s description the RIGHT WAY since if we provide a way to preserve comments, wouldn’t it be great to at least provide a subjective way to remove them?

Yes and no. Also scratching my head and wondering how ambitious a package description we wrote 🙈

Some background, this package started as part of GSoC, the original aim was to enable us to create a dart pub add <package> command.

Since then it's also be used for dart pub upgrade --major-versions, dart pub upgrade --tighten, and a few other things left and right.

We needed some way to preserve comments, because while JSON is a subset of YAML, I don't think anyone would like to use dart pub add if it parsed your pubspec.yaml added the package and then saved pubspec.yaml formatted as JSON. As people frequently have comments in their pubspec.yaml.

But we also don't really care to be perfectionist. Ideally, we just want to not cause comments and whitespace from parts of the YAML file unrelated to the modification we're making from being changed. But if a comment next to the thing we're changing becomes garbled or whitespace changes a bit, then perhaps that's okay'ish.

In theory I wouldn't object to this package having support for reading, stripping, modifying and removing comments. In theory I also wouldn't mind this package using heuristics to guess whether a comment belongs to a value being removed or not.

In practice this package is used to make small changes to YAML configuration files. If we don't limit the scope of the package, then it'll have an endless stream of bugs and feature requests.

It might be better declare that some behavior or feature is out of scope, than it is to have a buggy implementation. It's not like we have lot resources to invest in the maintenance of this package.

IMO, we should aim for (in order of priority):

  • (1) Reliably produce:
    • Valid YAML
    • A semantically correct mutation
      (_performEdit ensures this, by parsing and comparing the result, throwing if there are errors)
  • (2) Avoid internal errors
    • Bugs
    • Inconsistencies caught by _performEdit
  • (3) Stable API and predictable behavior.
  • (4) Preserve comments and whitespace in parts of the YAML document unrelated to the mutation.
  • (5) Reasonable heuristics for preservation or removal of comments and whitespace in part of the YAML document affected by the mutation.

I think that when adding features or tweaking (5) we should consider what implication it might have for maintainability and remember that the objective is to modify small configuration files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to admit that I'm not entirely sure what this PR does 🙈

  1. We can now encode strings with a trailing line-break as ScalarStyle.FOLDED or ScalarStyle.LITERAL instead of falling back to double quotes. See:

  2. Because we can do 1 above, it comes with a cost. The edit we suggest may have a line-break already since we ensure we preserve any dangling line-breaks after calling normalizeEncodedBlock which:

    • Preserves line-breaks for ScalarStyle.FOLDED or ScalarStyle.LITERAL or if the value of the YamlScalar itself has a line-break
    • Checks if the node being replaced had a line-break and preserves it in the new encoded block

The cost is to ensure we include any existing line-break in the edit without trying to introduce any breaking behavior to the current YamlEditor functionality.

From my previous comment, YamlEditor CURRENTLY wants to get rid of inline comments in the terminal node but does it poorly causing issues with edits with some of the tests actually enabling the bug. (See some of the failing tests when this PR is run. I’ll highlight them on request)

Thus, why I added skipAndExtractComments which skips the inline comments if that’s what YamlEditor wanted to do.

I must admit, maybe the function is a bit overzealous and returns the comments it skipped but I thought why not.

Also, this PR was just a PoC. Nothing more than that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kekavc24 how about you start an issue outline what we want to do.
And we try to reach consensus on that first.

I think perhaps it's best to try write tests and make small changes (many small PRs are infinitely better, than large ones). It's more work to split logic into separate PRs. But the challenge is also getting there without introducing bugs (actually, it's getting there while convincing maintainers of this package that we're not introducing new bugs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Literal and folded strings cannot handle "\n \n" at the end of a string
2 participants