Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add position to text pos #30

Open
faassen opened this issue Nov 14, 2024 · 3 comments
Open

add position to text pos #30

faassen opened this issue Nov 14, 2024 · 3 comments

Comments

@faassen
Copy link

faassen commented Nov 14, 2024

xmlparser errors return a TextPos, which Stream::gen_text_pos mentions is expensive to calculate. But I'd like to use a library like Ariadne or Miette to report parser errors. Both of these are offset based, i.e. a position.

I figured I'd just get the stream from the tokenizer when there's an error, using stream().pos(). I have to obtain the stream when there's an error, because stream() generates a copy, and I don't want a stale copy. But since Tokenizer is an iterator and is moved when I start iteration, I can't access it:

for token in tokenizer {
     match token {
        ...
       Err(e) => {
             let position = tokenizer.stream().pos(); // not allowed
        }
    }
}

The easiest way to fix this would be to simply retain the position in TextPos when it is calculated. Would you accept a PR that implements this?

For my use case, I'd actually prefer it if the TextPos calculation doesn't happen at all, as I don't need this to happen, but I don't see a way to do this without breaking backwards compatibilty.

Without this I think the only way forward is for me to write some really ugly code to calculate the position back from the text pos and the input as I don't see another way to obtain it.

@RazrFalcon
Copy link
Owner

You can simply use:

while let Some(token) = tokenizer.next() {
}

This way tokenizer will not be consumed.

@faassen
Copy link
Author

faassen commented Nov 15, 2024

Thanks, I should have realized that and I feel dumb now!

But after I changed that I realized that doing this in the error handler:

      let position = tokenizer.stream().pos();

is too late, as the stream already has moved on. I now have this:

let mut position;
loop {
   position = tokenizer.stream().pos();
   if let Some(token) = tokenizer.next() {
       match token {
            Ok(token) => ...,
            Err(e) => {
                 // use position here
             }
   }
}    

Obtaining a copy of the stream for each token so I can retain the position information just for the purposes of error handling means the error handling requirements have a performance impact on parsing overall, which isn't ideal but I can live with it.

I'm going to be bold and propose a design where Xot wouldn't incur the TextPos calculations but roxmltree can still have them:

  • xmlparser::Error only maintains position, and does no expensive TextPos calculation
  • in roxmltree, whenever an error with a TextPos needs to be produced, ctx is available so that ctx.doc.text could be used to construct TextPos.

Oh, after investigating all this I realized that xmlparser is an internal fork of roxmltree. Have they diverged or is the intent to keep the code identical? What are your plans for 'xmlparser' as a separate crate in general?

@RazrFalcon
Copy link
Owner

Why do you need tokenizer.stream().pos() to begin with, if each token has its position already?

As for any changes, I'm no longer maintaining this crate, as well as roxmltree. So no changes from me. It's up to the new maintainers to decide.

As for roxmltree, it has a modified fork of xmlparser which is a bit simpler and event/push based and not iterator/pull based like xmlparser. Which makes code simpler and faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants