Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text.pm regression since v0.58: YAML::Tiny failed to classify line ... #365

Closed
jonassmedegaard opened this issue Jul 3, 2022 · 10 comments
Closed

Comments

@jonassmedegaard
Copy link
Contributor

jonassmedegaard commented Jul 3, 2022

Text.pm chokes on a text containing a markdown horisontal ruler at the very top of the file.

This file was processed correctly until and including po4a v0.57 but fails with v0.58 onwards.

Problem is the introduction of the YAML Front Matter parser: Code wrongly treats leading triple-dash as always meaning beginning of a YAML header, but YAML metadata is optional for markdown text and the proper is therefore to only try parse as YAML but gracefully fallback to parsing as markdown in case of failure.

For now I have applied the workaround of using four dashes, but I am confident that the Text.pm module should be fixed.

@jonassmedegaard jonassmedegaard changed the title Text.pm regression since v0.57: YAML::Tiny failed to classify line ... Text.pm regression since v0.58: YAML::Tiny failed to classify line ... Jul 3, 2022
@mquinson
Copy link
Owner

mquinson commented Jul 3, 2022

Hello, thanks for reporting.

Is it really an issue? The only other possibility would be to allow po4a to silently fail on malformated YAML headers, and I feel bad about it. The experience show that people only look at po4a error messages when the build fails. I'm afraid of people not noticing important breakages if we allow the YAML header to fail.

If you insist that your file should be allowed to have --- in the header (please tell me why you need it), then I'll add an option to allow you to explicitly ignore the --- in the header and not even try to parse the YAML after it.

Thanks

@jonassmedegaard
Copy link
Contributor Author

I understand that this is a cornercase: I can only imagine it being rare to have markdown content that begins with a horisontal ruler - and additionally expressing that using exactly three dashes.

I do think, however, that it is not totally weird to have scenarios like mine: a markdown text being a footer, where that footer should have a horisontal ruler as its very first "word".
Also, I do think it is not uncommon for users of markdown to use YAML optionally - i.e. include YAML headers in some files but not others. Which means dsabling YAML parsing would be less ideal.

How about by default to parse YAML loosely, and have a strict option that makes it fail instead of gracefully processing as markdown?

I mean, fundamentally markdown was intended as failing gracefully, deliberately choosing markup that would be reasonably readable if not understood or mistyped.
(that's the reason I propose loose YAML parsing to be default, not as you implemented till now having YAML parsing be strict)

@mquinson
Copy link
Owner

mquinson commented Jul 3, 2022

I understand your point, but I'm still not convinced. Maintainers don't look at po4a output, even less at the files produced by po4a. I don't think it's safe to go for the loose default. I will extend the error message that you get when the file begins with ---. Take this as a proposal for now, and keep giving your opinion please.

mquinson added a commit that referenced this issue Jul 3, 2022
This is proposing a potential solution to
#365
@jonassmedegaard
Copy link
Contributor Author

What you propose is to continue letting po4a dictate how to write content. I find that ugly - sure, only for a cornercase, but still.

Please if you do not want to implement loose-by-default then at least implement loose-as-option and have the error message mention that option as an alternative.

@mquinson
Copy link
Owner

mquinson commented Jul 4, 2022

I'm 100% OK with the implementation of loose-as-option, and won't close that bug
until it's done. But it's not completely trivial either. I guess that one would
have to save the lines and references that you get from $self->shiftline() in
Text.pm::parse_markdown_yaml_front_matter() and then reinject them with
$self->unshiftline() if that's a failure.

But I'm still wondering what kind of data structure I should use to save them.
Either a list of vectors (ctn,ref), or a plain list with an alternance of both
elements. I'm also wondering whether I should unshift starting from the oldest
shifted element, or in reverse order. I guess reverse order.

If "someone" comes up with a patch implementing the loose-by-default (ie, just
continuing when the YAML does not parse correctly), I'll play with the po4a
option things to implement the explicit-loose (ie, activate the provided
behaviour only if explicitly asked by the user).

Does it sound like a plan? ;)

@jonassmedegaard
Copy link
Contributor Author

jonassmedegaard commented Jul 4, 2022

Sounds like a plan to me.

I wonder (without volunteering to dive in, however) if it really needs to be this complicated to solve:

Current code does something like this:

  1. Unconditionally call function parse_markdown_yaml_front_matter()
  2. Process any top lines hinted as if maybe perhaps being YAML
  3. Call YAML::Tiny (which fails if in fact it wasn't YAML after all)

Seems to me a simple solution would be to call YAML::Tiny twice, like this:

  1. Try call YAML::Tiny, and if it succeeds then call function parse_markdown_yaml_front_matter()
  2. Process any top lines hinted as if maybe perhaps being YAML (but known ahead that in fact it is)
  3. Call YAML::Tiny (which is known to not fail)

More elegant might be to only call YAML::Tiny once, passing its result to the function - or get rid of the separate function altogether: It is only called once ;-)

@mquinson
Copy link
Owner

mquinson commented Jul 6, 2022

One difficulty is to get the input from where it is (using TransTractor.pm::shiftline()), try to use YAML::Tiny on that content, and if it fails, reinject the data within the TransTractor so that the classical parser gets it.

I was only describing this part "pop the data, but make sure you can reinject it in case of problem" earlier. This is perfectly complementary with what you describe here.

@jonassmedegaard
Copy link
Contributor Author

Oh, I understand the complication now.

Sorry, if core code has no way of "rolling back" if hitting a dead end, then indeed my "simple approach" is not so simple after all. :-(

@mquinson
Copy link
Owner

mquinson commented Jul 6, 2022

the core code (TransTractor) has a rollback solution, but you have to unshift content+reference in the right order. Not necessarily difficult, but it has to be taken into account.

@jonassmedegaard
Copy link
Contributor Author

jonassmedegaard commented Jul 6, 2022

Right. I clearly didn't account for that (I have "only" juggled Perl for ~25 years - haven't familiarized myself with iterators yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants