Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(input): Add span support #61

Merged
merged 1 commit into from
Dec 27, 2022
Merged

feat(input): Add span support #61

merged 1 commit into from
Dec 27, 2022

Conversation

epage
Copy link
Collaborator

@epage epage commented Dec 27, 2022

There are two main crates that provide location support for nom

  • nom-locate: wide adoption for long time
  • pori: more recent, only really being used by one application

While pori is less mature, I think it iterates quite well on nom-locate; so let's break this down.

How to track location

  • pori: tracks how much input has advanced
    • nom-locate: same as above but also tracks the line number
  • leave initial input unchanged and only change begin/end indices, doing unchecked slicing of the input to get the current input
    • Note: this is fairly similar to pom whose input is the Input + location
  • keep initial and current input and use offset
  • keep start pointer from initial and current input

For now, I'm going with (initial, input) because it is relatively easy to implement and the bookkeeping costs are paid when doing location lookups. The downside is it takes up more space, especially if Located isn't the inner-most wrapper type. We could reduce this somewhat by using <I as IntoOutput>::Output for initial. If we make more assumptions on inner types, we could just capture the start pointer and do math with that when needed.

The problem with nom-locate tracking line numbers is it pessimizes performance. Knowing the line number is more of a UI concern and should be deferred until there (if the UI will even ask for it).

How to capture a span

  • pori combinators that returns (usize, usize)
  • nom-locate use the wrapper type
    • This makes any recognize, take, etc capture a span which is convenient

I was originally attracted to nom-locates approach but the downside is that this makes these types more awkward to work with when not capturing a span. I decided to instead use Parser::span and Parser::with_span combinators, similar to recognize and with_recognized. poris combinators returned (usize, usize), meaning (start, length) which could be confused with (start, end) with "end" being inclusive or not. To make this clearer, I used Range<usize>. Now how it works is embedded in the type system.

While the approach in this commit is a new design, it iterates on previous designs with an effort to simplify them down to their most basic parts. I feel fairly confident that this is something we can commit API stability to.

There are two main crates that provide location support for nom
- `nom-locate`: wide adoption for long time
- `pori`: more recent, only really being used by one application

While `pori` is less mature, I think it iterates quite well on
`nom-locate`; so let's break this down.

How to track location
- `pori`: tracks how much `input` has advanced
  - `nom-locate`: same as above but also tracks the line number
- leave initial input unchanged and only change begin/end indices, doing
  unchecked slicing of the input to get the current input
- keep initial and current input and use offset
- keep start pointer from initial and current input

For now, I'm going with (initial, input) because it is relatively easy
to implement and the bookkeeping costs are paid when doing location
lookups.  The downside is it takes up more space, especially if
`Located` isn't the inner-most wrapper type.  We could reduce this
somewhat by using `<I as IntoOutput>::Output` for `initial`.  If we make
more assumptions on inner types, we could just capture the start pointer
and do math with that when needed.

The problem with nom-locate tracking line numbers is it
pessimizes performance.  Knowing the line number is more of a UI concern
and should be deferred until there (if the UI will even ask for it).

How to capture a span
- `pori` combinators that returns `(usize, usize)`
- `nom-locate` use the wrapper type
  - This makes any `recognize`, `take`, etc capture a span which is
    convenient

I was originally attracted to `nom-locate`s approach but the downside is
that this makes these types more awkward to work with when **not**
capturing a span.  I decided to instead use `Parser::span` and
`Parser::with_span` combinators, similar to `recognize` and
`with_recognized`.  `pori`s combinators returned `(usize, usize)`,
meaning (start, length) which could be confused with (start, end) with
"end" being inclusive or not.  To make this clearer, I used
`Range<usize>`.  Now how it works is embedded in the type system.

While the approach in this commit is a new design, it iterates on
previous designs with an effort to simplify them down to their most
basic parts.  I feel fairly confident that this is something we can
commit API stability to.
@coveralls
Copy link

Pull Request Test Coverage Report for Build 3789359607

  • 0 of 112 (0.0%) changed or added relevant lines in 3 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-2.01%) to 56.221%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/parser.rs 0 2 0.0%
src/combinator/mod.rs 0 14 0.0%
src/input.rs 0 96 0.0%
Files with Coverage Reduction New Missed Lines %
src/character/complete.rs 1 79.58%
Totals Coverage Status
Change from base Build 3789169739: -2.01%
Covered Lines: 1821
Relevant Lines: 3239

💛 - Coveralls

@epage
Copy link
Collaborator Author

epage commented Dec 30, 2022

Stateful<I, ()> has negligible performance impact.

Located<I> takes about a 30% parsing hit

@epage
Copy link
Collaborator Author

epage commented Dec 30, 2022

I just implemented pori style Located and it seems be 30% slower than what is implemented (#63)

I also tested Stateful with usize, (usize, usize), and (usize, usize, usize) and it mirrored the Located<I> performance and then continued to get slower from there.

@epage
Copy link
Collaborator Author

epage commented Dec 30, 2022

#64 shows that nom-locate is even slower.

@epage
Copy link
Collaborator Author

epage commented Dec 30, 2022

I suspect the only way to capture back some of the performance is if we switch from the input being returned to it being a &mut Input parameter. This would reduce the size of the return type which I suspect is the biggest slow down.

@epage
Copy link
Collaborator Author

epage commented Dec 30, 2022

Interesting, pom takes the approach of Parser inpuits being a (&str, usize) and the return type includes usize, so you are constantly indexing into the input but have to pass the original input around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants