Phase 1 of new HTML parser #2602

tabatkins · 2023-07-10T23:25:48Z

This part's fairly simple - I just run the entire document thru the new HTML parser and immediately reserialize it. Benefits:

Detects some markup mistakes, like duplicated attributes
Correctly adds line numbers to everything in your source. (Some elements coming from metadata or datablocks still won't have correct line numbers.)
Handles markdown code spans, code blocks, CSS <<type>> autolinks, and CSS ''maybe'' autolinks correctly, in the parser, rather than hackily via a regex.

It then still runs the datablock, markdown, and finally the existing HTML parser over the spec, so this is probably slightly slower at the moment, but those will be eaten by the parser in later phases.

Known issues:

If you were working around the hacky ` parsing by using \` inside of a code block
(which shouldn't have any parsing done inside)
it no longer does any parsing inside.
Fix: Remove the spare \ so you just have valid JS again.
(cookie-store, idle-detection)
If you weren't working around this your spec was probably broken,
and now it's fixed!
(local-font-access)

The HTML parser runs before the Markdown parser right now
(fully integrating the two together is the next project),
so a tag broken across a line inside a blockquote
will parse incorrectly
(it's closed prematurely by the blockquote's > at the start of the next line)
Fix: just put the whole tag on one line for now.
(scroll-to-text-fragment, web-animations-2)

Previously, ''<foo>'' would make a maybe autolink to <foo> as a value.
Now it's equivalent to <css>&lt;foo></css>,
which is broken,
but arguably it was always broken in the first place.
Fix: change to ''<foo>''.
(css-properties-values-api)

The Markdown behavior of "if you have spaces at both the start and end of a code span, remove one from each side" is now properly implemented. A few specs relied on it removing any amount of any whitespace, so now there's an extra space sometimes if you did a linebreak inside your code span for some reason.
Fix: Put the code span all on one line, or at least don't linebreak between the end of the content and the closing ticks.
(The serializer still collapses starting/ending whitespace down to a single character to make the content look better, so that might still be doing more stripping than you want too.)
(webpackage)

…e processing first.

… them manually.

…opriate metadata.

… context later.

… the existing comment-handler code do it for now.

…nything else. Remove the other hacky pre-processing things (markdown code spans, CSS type and maybe links) and handle them properly in the parser instead.

…xits, and properly handle failures of the raw block stuff.

tabatkins added 17 commits June 22, 2023 12:34

Update parser.py to latest version from branch

08c19a0

Add a folder filter to testing for easier targeting.

464f97d

Move the mixed-indent check to later in the pipeline, so I can do som…

e739a77

…e processing first.

Auto-omit bs-* attributes from serialization so I don't have to strip…

62f6135

… them manually.

Copy over the escaping functions to avoid some circular import issues.

40f9218

Eagerly parse (and ignore the contents of) datablocks for now.

2fe072e

Parse CSS maybes, and condition both CSS and Markdown behind the appr…

5ac01ba

…opriate metadata.

Hackily allow self-closing SVG and MathML. Will allow this via proper…

b5b38a9

… context later.

Fix ws handling in markdown code spans and blocks.

ebac687

Stop removing comments; it breaks things. Just blank them out and let…

44e1ee4

… the existing comment-handler code do it for now.

Add more parser entry points.

fe449f1

Invoke the parser early; after extracting metadata but before doing a…

332d25e

…nything else. Remove the other hacky pre-processing things (markdown code spans, CSS type and maybe links) and handle them properly in the parser instead.

Fix lint

85cbd4f

Pull out the 'starts with an <' code to a function for easier early-e…

d1c74f1

…xits, and properly handle failures of the raw block stuff.

rebase tests

bbf1f9a

black

93b6a8f

lint

a64e0e0

tabatkins merged commit ad86d38 into main Jul 11, 2023

tabatkins deleted the parser-v2 branch July 11, 2023 20:37

tabatkins mentioned this pull request Jul 11, 2023

Release Notes #1773

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1 of new HTML parser #2602

Phase 1 of new HTML parser #2602

tabatkins commented Jul 10, 2023

Phase 1 of new HTML parser #2602

Phase 1 of new HTML parser #2602

Conversation

tabatkins commented Jul 10, 2023