Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase 1 of new HTML parser #2602

Merged
merged 17 commits into from
Jul 11, 2023
Merged

Phase 1 of new HTML parser #2602

merged 17 commits into from
Jul 11, 2023

Conversation

tabatkins
Copy link
Collaborator

This part's fairly simple - I just run the entire document thru the new HTML parser and immediately reserialize it. Benefits:

  • Detects some markup mistakes, like duplicated attributes
  • Correctly adds line numbers to everything in your source. (Some elements coming from metadata or datablocks still won't have correct line numbers.)
  • Handles markdown code spans, code blocks, CSS <<type>> autolinks, and CSS ''maybe'' autolinks correctly, in the parser, rather than hackily via a regex.

It then still runs the datablock, markdown, and finally the existing HTML parser over the spec, so this is probably slightly slower at the moment, but those will be eaten by the parser in later phases.

Known issues:

If you were working around the hacky ` parsing by using \` inside of a code block
(which shouldn't have any parsing done inside)
it no longer does any parsing inside.
Fix: Remove the spare \ so you just have valid JS again.
(cookie-store, idle-detection)
If you weren't working around this your spec was probably broken,
and now it's fixed!
(local-font-access)

The HTML parser runs before the Markdown parser right now
(fully integrating the two together is the next project),
so a tag broken across a line inside a blockquote
will parse incorrectly
(it's closed prematurely by the blockquote's > at the start of the next line)
Fix: just put the whole tag on one line for now.
(scroll-to-text-fragment, web-animations-2)

Previously, ''&lt;foo>'' would make a maybe autolink to <foo> as a value.
Now it's equivalent to <css>&amp;lt;foo></css>,
which is broken,
but arguably it was always broken in the first place.
Fix: change to ''<foo>''.
(css-properties-values-api)

The Markdown behavior of "if you have spaces at both the start and end of a code span, remove one from each side" is now properly implemented. A few specs relied on it removing any amount of any whitespace, so now there's an extra space sometimes if you did a linebreak inside your code span for some reason.
Fix: Put the code span all on one line, or at least don't linebreak between the end of the content and the closing ticks.
(The serializer still collapses starting/ending whitespace down to a single character to make the content look better, so that might still be doing more stripping than you want too.)
(webpackage)

@tabatkins tabatkins merged commit ad86d38 into main Jul 11, 2023
@tabatkins tabatkins deleted the parser-v2 branch July 11, 2023 20:37
@tabatkins tabatkins mentioned this pull request Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant