-
-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for zero-copy parsing #9
Comments
Does this imply being able to do something like this? fn parse() -> impl Parser<char, &str, Error = ....> {} |
Yes, among other things. There are a few complications though: feeding lifetimes through is quite an intricate process, and it's going to require a fairly fundamental rework of the There are also complications with existing parsers: if One option I'm considering is a combinator that complements |
@zesterer Are you still looking for any help with the zero-copy implementation? I'd love to lend a helping hand. I just finished browsing the whole branch and it looks fantastic so far. Really interesting! |
@dark-fusion Definitely! I've been on a bit of a hiatus over the past few weeks (a mixture of burnout and a lot of work), but it's beginning to look like something that's of merge quality. I think it's probably time for me to write a list of things that need doing before merge (aside from waiting for GAT stabilisation). What sort of things were you interested in working on? |
I'm most interested in adding support for binary-encoded formats, however I also understand that it would involve quite a bit of work and may be best suited as an extension crate (at least for now). This was also already discussed in #48. While waiting for certain features to land on Rust's stable channel, I'm not sure where an extra set of hands would be most welcome. |
Hello hello :D I am looking to contribute more to chumsky, and am curious on if there are any open tasks that need to be worked on related to zero-copy parsing, and if so, what those might be! Thanks for your time! |
Hey, thanks for showing interest! There are a few things on the list, but it really depends on what you're interested in.
Any one or part of these would be really appreciated and would definitely help speed up the process of getting the branch ready for merge. I'd love to dedicated more time to it myself, but I'm going through quite a busy period in my life right now and have a bit less time to spare than I normally do :( |
Thanks for the list :D I will take a crack at |
Specs
IssueI am not sure what the exact issue is, but I don't know if I am up for any and all recommendations you have here, as I am kinda at a loss. I know you are busy, so there's no rush! Thanks for your time in advance. |
@Zij-IT The very long That aside, make sure to throw |
I mentioned #197 that I was able to get it to compile, but I figured I would show you some of the fruits of your labor :D Specs
Task
Results
I would say that is quite the achievement! I am curious how where there is still room for improvement and excited to see that number working its way down! |
That's quite a nice improvement! Not as much as I'd hoped though, unfortunately. I think there's still more to be done here. Are these just the numbers for the lexer alone, or the parser too? I've realised that the |
Its for both the lexer and the parser! Each of the 200ish files goes from |
Its me again, and with some fantastic news for me, and uncertain news for let power = atom
.clone()
.then(just(TokenKind::OpPow).to(BinaryOp::Power))
.repeated()
.then(atom.labelled("binary operand"))
.foldr(|(lhs, op), rhs| Expr::Binary {
lhs: Box::new(lhs),
rhs: Box::new(rhs),
op,
}); It parses EVERY test('fs chmod', function(expect)
local mode_async
local mode_sync
-- On Windows chmod is only able to manipulate read-only bit
-- TODO: test on windows
if is_windows then
mode_async = 256 --[[tonumber('0400', 8)]] -- read-only
mode_sync = 438 --[[tonumber('0600', 8)]] -- read-write
else
mode_async = 511 --[[tonumber('0777', 8)]]
mode_sync = 420 --[[tonumber('0644', 8)]]
end
local file1 = Path.join(__dirname, 'fixtures', 'a.lua')
local file2 = Path.join(__dirname, 'fixtures', 'a1.lua')
local function maskMode(mode, mask)
return bit.band(mode, mask or 511 --[[tonumber('0777',8)]])
end
fs.chmod(file1, mode_async, expect(function(err)
assert(not err)
if is_windows then
assert(maskMode(maskMode(fs.statSync(file1).mode), mode_async))
else
assert(mode_async == maskMode(fs.statSync(file1).mode))
end
-- TODO: accept mode in number
assert(fs.chmodSync(file1, mode_sync))
if is_windows then
assert(maskMode(maskMode(fs.statSync(file1).mode), mode_sync))
else
assert(mode_sync == maskMode(fs.statSync(file1).mode))
end
end))
fs.open(file2, 'a', tonumber('0666', 8), expect(function(err, fd)
assert(not err)
fs.fchmod(fd, mode_async, expect(function(err)
assert(not err)
if is_windows then
assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_async))
else
assert(mode_async == maskMode(fs.fstatSync(fd).mode))
end
-- TODO: accept mode in number
assert(fs.fchmodSync(fd, mode_sync))
if is_windows then
assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_sync))
else
assert(mode_sync == maskMode(fs.fstatSync(fd).mode))
end
fs.close(fd)
end))
end))
-- lchmod
if fs.lchmod then
local link = Path.join(__dirname, 'tmp', 'symbolic-link')
fs.unlinkSync(link)
fs.symlinkSync(file2, link)
fs.lchmod(link, mode_async, expect(function(err)
assert(not err)
p(fs.lstatSync(link).mode)
assert(mode_async == maskMode(fs.lstatSync(link).mode))
-- TODO: accept mode in number
fs.lchmodSync(link, string.format('%o', mode_sync))
assert(mode_sync == maskMode(fs.lstatSync(link).mode))
end))
end
end) There are of course as you know many ways to solve this problem; I just went ahead and used the |
Out of interest, how do just the lexing times compare? My suspicion is that lexing probably got a lot faster, but parsing didn't (because parsing already requires a lot of collecting into vectors, recursive data structures, etc. so there probably wasn't much that
Ah yes, this is quite a common pitfall. Chumsky doesn't have memoisation, so it'll eagerly attempt to parse both interpretations twice, which can rapidly become exponential! I'm quite surprised that |
I will give it a look when I can tomorrow, but I have only moved over the parser as of currently, so both lexers are using the non-zero-copy parser combinators, and are identical. Sorry for the confusion 😅 My branch naming scheme when working on my own stuff needs work. |
@Zij-IT Could you further elaborate? I do not really understand why this is the case. |
Here is the parser again to prevent you having to scroll up :D let power = atom
.clone()
.then(just(TokenKind::OpPow).to(BinaryOp::Power))
.repeated()
.then(atom.labelled("binary operand"))
.foldr(|(lhs, op), rhs| Expr::Binary {
lhs: Box::new(lhs),
rhs: Box::new(rhs),
op,
}); To move this onto more pseudocode and hopefully more understandable, here it is as EBNF
Now, the important thing here is that the left and right hand side of the
To accurate parse the above expression we have to repeat the
The problem is that in both steps 3 and 4 we have to parse For this tiny example, that is no issue. But in the Lua example I have above, some of the lines are nested in three lamda definitions, and because each expression will be parsed twice, every item within the third call will be parsed at a minimum 2^3 times. |
Heyyo :) Me again! Life's been hectic for the last couple months, but I am excited to start helping out again. Any issues you are currently hoping to get marked of the todo list for zero-copy that I can help with? |
Hey, hello again! I think a big one would be porting docs / text parsers over, and perhaps some of the newer parsers like I'm also considering something big (but fundamentally uncomplicated): switching out the |
If you can describe the tasks (or make a small example to replicate), I might try to play with it, trying to automate most of it with vim if it can help you 🤔 |
It's a relatively simple process. Everywhere you see |
This is now being developed in #82
The text was updated successfully, but these errors were encountered: