Replaced the `String` in `Token` with `&'a str` #559

rben01 · 2024-09-09T04:15:32Z

Unfortunately due to borrow checker limitations, this required moving input fields out of both Parser and Tokenizer, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input. The underlying issue is that returning a Token<'_> that borrows from &self really trips up the borrow checker in a way that a non-borrowing Token doesn't.
So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token])
And there were a lot... a-lot-a-lot...
But now Token doesn't carry an owned String
Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Also made ForeignFunction.name a &'static str instead of a String.

No new tests, but all existing tests pass

Unfortunately due to borrow checker limitations, this required moving `input` fields out of both `Parser` and `Tokenizer`, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input So, the input is now an (immutable ref) argument to all the methods that used to refer to `&self.input` And there were a lot... But now `Token` doesn't carry an owned `String` Also shrunk `Tokenizer` by doing all math in terms of byte offsets into its input (using existing `SourceCodePosition` fields) instead of storing a separate `Vec<char>` with char indices

… used on a newline and would've had no effect, it is semantically correct to keep the original `+= 1`)

…tests

…self.context.lock()` as the lock must be dropped to allow `self` to be borrowed in the interm)

sharkdp · 2024-09-11T18:14:11Z

Really nice! As you saw in the TODO comment, I have been meaning to do this eventually.

I think this will also enable a lot of downstream optimizations if we push this even further (with things referring to the original source instead of cloning stuff).

Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token])

Yeah, it's not great. It's maybe not too bad either, so I will just merge your PR as is. Thank you very much for this valuable contribution!

rben01 · 2024-09-23T20:30:34Z

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

Right, sounds like you could store just the byte offset in a Span, and separately store the cumsum of line lengths. Then to find where a byte offset goes, binary search to find the largest line length cumsum less than the byte offset — that's your line number. Position in the line is byte offset minus that cumsum.

This doesn't sound that bad, just annoying. Worth me giving it a shot? I suppose the questions is where to store the cumsum of line lengths. Probably in the same place the lines themselves are stored? (I don't actually know where that is, but obviously it exists because it's used to print error messages.)

rben01 · 2024-09-25T04:05:24Z

Hey, it looks like only the removal of ctx.dimension_registry().clone() was merged, not the whole PR. Was this intentional? (Or am I misunderstanding GitHub’s UI? GitHub says my branch is 12 commits ahead of master.)

sharkdp · 2024-09-25T06:27:23Z

Your branch is properly integrated (check out the commit history of this repo). I rebased your branch on top of master instead of creating a merge commit. The rebase creates new commits which are not identical (and have a different hash) compared to your local commits. This is probably why you see "N commits ahead of..".

rben01 added 10 commits September 8, 2024 00:45

Added .DS_Store to gitignore

7da0683

Replaced String with &'static str in ForeignFunction

21a3b6f

Removed // todo comment (we did the todo)

ecc5ae1

Fixed semantically incorrect use of char::len_utf8 (although it was…

e2a096d

… used on a newline and would've had no effect, it is semantically correct to keep the original `+= 1`)

Renamed arg for clarity

d1a645c

Removed pointless .into() on &'static str into itself

6d7aed2

Fixed a handful of clippy "needless & on &str" warnings

4efa7a6

Fixed inadvertently removed trailing whitespaces whose removal broke …

47b95c7

…tests

Removed clone of ctx.dimension_registry() (requires an additional `…

d0e1ad1

…self.context.lock()` as the lock must be dropped to allow `self` to be borrowed in the interm)

sharkdp merged commit a28608d into sharkdp:master Sep 11, 2024
15 checks passed

rben01 mentioned this pull request Sep 24, 2024

Shrink span #578

Merged

sharkdp mentioned this pull request Oct 2, 2024

Improve startup time (regression) #525

Open

BrewTestBot mentioned this pull request Oct 11, 2024

numbat 1.14.0 Homebrew/homebrew-core#193820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced the `String` in `Token` with `&'a str` #559

Replaced the `String` in `Token` with `&'a str` #559

rben01 commented Sep 9, 2024 •

edited

Loading

sharkdp commented Sep 11, 2024

rben01 commented Sep 23, 2024

rben01 commented Sep 25, 2024 •

edited

Loading

sharkdp commented Sep 25, 2024

Replaced the String in Token with &'a str #559

Replaced the String in Token with &'a str #559

Conversation

rben01 commented Sep 9, 2024 • edited Loading

sharkdp commented Sep 11, 2024

rben01 commented Sep 23, 2024

rben01 commented Sep 25, 2024 • edited Loading

sharkdp commented Sep 25, 2024

Replaced the `String` in `Token` with `&'a str` #559

Replaced the `String` in `Token` with `&'a str` #559

rben01 commented Sep 9, 2024 •

edited

Loading

rben01 commented Sep 25, 2024 •

edited

Loading