Ensure `}` and `{` are valid boundary characters #17001

RobinMalfait · 2025-03-06T16:38:57Z

In a lot of templating languages, {…} is used to escape the template language to write some code logic, this means that if you have input that looks like this:

<div class="flex items-center mx-4{% if session.isValid %}{% else %} h-4{% endif %}"></div>

That the if condition will be gone once it's compiled to HTML, but it also means that the } and { characters next to mx-4{ and h-4 should be valid boundary characters because Tailwind CSS typically operates on the source code, not the compiled code.

I didn't create a pre processor for this because unfortunately this syntax is often embedded in .html files, and I would love to not have to create a pre processor for .html files.

You might notice that there are other changes in this PR, they are all related to boundary characters, let me explain:

In the named utility machine, we repeated the allowed boundary characters. We will now re-use the shared function so that it's all in one location.
We still check for a few additional characters such as : because of JS keys, / because it could be a start of a modifier and ! because the named utility could be important.

This actually exposed a bug where the class attribute in <div class="…">…</div> was not extracted (which we typically want). However, in languages like Svelte, this is a real thing you can do: <div class:px-4="condition"></div>. The pre processor will make sure to convert this input to <div class px-4="condition"></div> but that still means that we need to extract the px-4.

if a utility ends in an number this worked as expected, but <div class:flex="condition"></div> did not work. This PR also fixes that by adding a dedicated test and it's also the reason why you see a bunch of class candidates in the tests.

Fixes: #16999

Test plan

Added new test
Existing tests pass

We were duplicating the valid boundary characters in the `named_utility_machine`, let's re-use the `is_valid_after_boundary` such that the valid boundary characters are listed in a single spot. However, in the `named_utility_machine` there are a few characters that _are_ valid as boundary characters because this machine on it's own is incomplete, there could be any of these characters as well: - `:`, because of JS object keys - `/`, because of modifiers - `!`, because of important

A `.` is not a valid boundary character anymore so these tests start failing. However it is safe to delete these because: 1. The Pug syntax tests live in the main extractor already 2. The tests in the main extractor will run through the pre processing step

Now that we re-use the boundary characters, it does mean that the `class` in `<div class="…"></div>` is also extracted. This wasn't the case before, however there are languages where we do need this, e.g.: In Svelte you can do this: ```svelte <div class:px-4="bool"></div> ``` We pre-process this, such that it becomes: ```svelte <div class px-4="bool"></div> ``` But we still need to extract the `px-4` here. Best part is, if this `px-4` was just a `flex` instead, then it wouldn't have worked (due to hitting this current code path instead of the one for numbers). Added an additional test for: ```svelte <div class:flex="bool"></div> ```

thecrypticace · 2025-03-06T19:08:25Z

crates/oxide/src/extractor/named_utility_machine.rs

                            }
+
+                            // Still valid characters
+                            cursor.advance()


This is fine for now but we'll want to perf test this. I wouldn't be surprised if this implementation causes a bit of a hit

Yep will do some classification next, and I still have a WIP branch to get rid of the Cursor

thecrypticace · 2025-03-06T19:08:34Z

crates/oxide/src/extractor/named_utility_machine.rs

+                            } =>
+                            {
+                                return self.done(self.start_pos, cursor)
+                            }


ditto here too

This PR cleans up the boundary character checking by using similar classification techniques as we used for other classification problems. For starters, this moves the boundary related items to its own file, next we setup the classification enum. Last but not least, we removed `}` as an _after_ boundary character, and instead handle that situation in the Ruby pre processor where we need it. This means the `%w{flex}` will still work in Ruby files. --- This PR is a followup for #17001, the main goal is to clean up some of the boundary character checking code. The other big improvement is performance. Changing the boundary character checking to use a classification instead results in: Took the best score of 10 runs each: ```diff - CandidateMachine: Throughput: 311.96 MB/s + CandidateMachine: Throughput: 333.52 MB/s ``` So a ~20MB/s improvement. # Test plan 1. Existing tests should pass. Due to the removal of `}` as an after boundary character, some tests are updated. 2. Added new tests to ensure the Ruby pre processor still works as expected. --------- Co-authored-by: Jordan Pittman <jordan@cryptica.me>

RobinMalfait added 4 commits March 6, 2025 17:36

add assert_extract_candidates_contains helper

41eecaa

add failing test for twig syntax

84f3fe3

ensure }…{ are valid boundary characters

ea2eec5

cleanup unused import

7258503

RobinMalfait requested a review from a team as a code owner March 6, 2025 16:38

update changelog

609b0af

RobinMalfait marked this pull request as draft March 6, 2025 16:44

RobinMalfait added 4 commits March 6, 2025 19:22

add additional test case

f0ba1d2

RobinMalfait marked this pull request as ready for review March 6, 2025 19:04

thecrypticace reviewed Mar 6, 2025

View reviewed changes

crates/oxide/src/extractor/named_utility_machine.rs

} =>

{

return self.done(self.start_pos, cursor)

}

Copy link

Contributor

thecrypticace Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here too

thecrypticace approved these changes Mar 6, 2025

View reviewed changes

RobinMalfait merged commit 57e91a6 into main Mar 6, 2025
5 checks passed

RobinMalfait deleted the fix/issue-16999 branch March 6, 2025 19:40

This was referenced Mar 6, 2025

Missing classes because of { boundary and new extractor #16999

Closed

Improve boundary classification #17005

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure `}` and `{` are valid boundary characters #17001

Ensure `}` and `{` are valid boundary characters #17001

RobinMalfait commented Mar 6, 2025 •

edited

Loading

thecrypticace Mar 6, 2025

RobinMalfait Mar 6, 2025

thecrypticace Mar 6, 2025

Ensure } and { are valid boundary characters #17001

Ensure } and { are valid boundary characters #17001

Conversation

RobinMalfait commented Mar 6, 2025 • edited Loading

Test plan

thecrypticace Mar 6, 2025

Choose a reason for hiding this comment

RobinMalfait Mar 6, 2025

Choose a reason for hiding this comment

thecrypticace Mar 6, 2025

Choose a reason for hiding this comment

Ensure `}` and `{` are valid boundary characters #17001

Ensure `}` and `{` are valid boundary characters #17001

RobinMalfait commented Mar 6, 2025 •

edited

Loading