Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced the String in Token with &'a str #559

Merged
merged 10 commits into from
Sep 11, 2024

Conversation

rben01
Copy link
Contributor

@rben01 rben01 commented Sep 9, 2024

Unfortunately due to borrow checker limitations, this required moving input fields out of both Parser and Tokenizer, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input. The underlying issue is that returning a Token<'_> that borrows from &self really trips up the borrow checker in a way that a non-borrowing Token doesn't.
So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token])
And there were a lot... a-lot-a-lot...
But now Token doesn't carry an owned String
Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Also made ForeignFunction.name a &'static str instead of a String.

No new tests, but all existing tests pass

Unfortunately due to borrow checker limitations, this required moving `input` fields out of both `Parser` and `Tokenizer`, as with the immutable borrow in place, there is no way to tell Rust that a mutable borrow won't touch the input
So, the input is now an (immutable ref) argument to all the methods that used to refer to `&self.input`
And there were a lot...
But now `Token` doesn't carry an owned `String`
Also shrunk `Tokenizer` by doing all math in terms of byte offsets into its input (using existing `SourceCodePosition` fields) instead of storing a separate `Vec<char>` with char indices
… used on a newline and would've had no effect, it is semantically correct to keep the original `+= 1`)
…self.context.lock()` as the lock must be dropped to allow `self` to be borrowed in the interm)
@sharkdp
Copy link
Owner

sharkdp commented Sep 11, 2024

Really nice! As you saw in the TODO comment, I have been meaning to do this eventually.

I think this will also enable a lot of downstream optimizations if we push this even further (with things referring to the original source instead of cloning stuff).

Also shrunk Tokenizer by doing all math in terms of byte offsets into its input (using existing SourceCodePosition fields) instead of storing a separate Vec<char> with char indices

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

So, the input is now provided as a shared ref argument to all the methods that used to refer to &self.input (now either a &str or a &[Token])

Yeah, it's not great. It's maybe not too bad either, so I will just merge your PR as is. Thank you very much for this valuable contribution!

@sharkdp sharkdp merged commit a28608d into sharkdp:master Sep 11, 2024
15 checks passed
@rben01
Copy link
Contributor Author

rben01 commented Sep 23, 2024

Cool! I read somewhere that production-grade parsers typically only keep byte offsets around, instead of using something like SourceCodePosition. And only if an error is shown to the user, then you do the actual work of computing line/position from the byte offset. This makes Spans much smaller (currently 32 byte). And Spans are everywhere.

Right, sounds like you could store just the byte offset in a Span, and separately store the cumsum of line lengths. Then to find where a byte offset goes, binary search to find the largest line length cumsum less than the byte offset — that's your line number. Position in the line is byte offset minus that cumsum.

This doesn't sound that bad, just annoying. Worth me giving it a shot? I suppose the questions is where to store the cumsum of line lengths. Probably in the same place the lines themselves are stored? (I don't actually know where that is, but obviously it exists because it's used to print error messages.)

@rben01 rben01 mentioned this pull request Sep 24, 2024
@rben01
Copy link
Contributor Author

rben01 commented Sep 25, 2024

Hey, it looks like only the removal of ctx.dimension_registry().clone() was merged, not the whole PR. Was this intentional? (Or am I misunderstanding GitHub’s UI? GitHub says my branch is 12 commits ahead of master.)

@sharkdp
Copy link
Owner

sharkdp commented Sep 25, 2024

Your branch is properly integrated (check out the commit history of this repo). I rebased your branch on top of master instead of creating a merge commit. The rebase creates new commits which are not identical (and have a different hash) compared to your local commits. This is probably why you see "N commits ahead of..".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants