Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose lexer, as a StAX parser #1219

Closed
jk-jeon opened this issue Aug 29, 2018 · 4 comments
Closed

Expose lexer, as a StAX parser #1219

jk-jeon opened this issue Aug 29, 2018 · 4 comments
Labels
state: needs more info the author of the issue needs to provide more details

Comments

@jk-jeon
Copy link

jk-jeon commented Aug 29, 2018

This is a feature request.

What I want to do is to find the position of an entry in a JSON file, where the description of the entry is given as a JSON pointer. I think nlohmann/json already contains sufficient features to do this, but unfortunately most of those features are hidden (namespace detail) or inaccessible (private).

For example, I want to see what tokens consist the given JSON pointer, but the only exposed interface of json_pointer is to restore the original string that was used to build the pointer. I think it would be nice if there are token-based interfaces (builder, accessors, etc.), because besides actual accessing through a JSON object, that's what we are usually interested in when we deal with JSON pointers.

Well, though not convenient, it was not very difficult to write my own tokenizer for JSON pointers. But how about JSON documents? To find out the position of a specific entry inside a JSON document, I need to parse the document, fighting against all the details of the JSON standard, unicode stuffs, error handling, etc.. I don't need the resulting DOM JSON object (indeed I can't, because the document size is quite big), and the SAX parser doesn't help neither because it just removes the byte position information except for error handling. (Personally, I prefer StAX to SAX, and I believe that's the common stance.)

It seems that the complicated jobs (unicode stuffs, number parsing, BOM's, escape syntax, input adapter, etc.) are all done by nlohmann::detail::lexer. The role of parser is to do those complicated jobs themselves and enable us to focus on the things that are interesting to us. Besides the loss of byte position information (which is vital if we want to work with a large JSON file directly without scanning/rewrite the whole file everytime), the SAX parser included in nlohmann/json, in its current form, is not sufficient for that goal. All it offers is just the ability to determine what to do when something comes up, which enforces us to implement a fairly complicated state machine, and callback functions full of complicated branchings.

I've roughly looked at implementation details of nlohmann::detail::lexer, and it seems that it provides sufficient features to allow me to write my own parser. The implementation is high-quality, the interface is well-structured, well-conforming the standard, and the code is well-documented. I like it!
Why not refine some interfaces and then expose it outside the namespace detail?

@nlohmann
Copy link
Owner

nlohmann commented Sep 8, 2018

So you basically just want to have public access to the lexer interface?

@jk-jeon
Copy link
Author

jk-jeon commented Sep 10, 2018

@nlohmann Yes. AFAIK, RapidJSON also exposes its component corresponding to lexer.
But also I have several suggestions.

  1. Currently, lexer does not provide a reliable way of getting the starting position of a token it read recently. I think the starting position could be a valuable information.
    (At least for my purpose, it was indeed useful. But since it is not directly provided by lexer, I cached the former return value of get_position() and used it as the starting position of the next token. This is not exactly correct though, because in this case white spaces between the last token and the current token are considered to be "included" in the current token.)
    It seems that the location information provided by lexer (the get_position() member function) is currently used outside only as a part of information provided to SAX error handlers. I expect the starting position of the recent token would be quite useful for someone who wants to implement a sort of "automatic recovery" or something, though I have no experience on making such a thing.

  2. Currently, number decoding is always done right after a number token is read. This might incur some unnecessary overhead for someone who don't care about the value of that token. Thus, it might be good if decoding is done only when explicitly asked. I expect this is not a simple request and would require some large breaking changes in the interface of lexer, because the decoder tries to produce a floating-point number from a seemingly-integer token if the number is too large. Consequently, it is impossible to predetermine the token type among value_unsigned, value_integer, and value_float, or even possibly parse_error for excessively large numbers. Accordingly, the way you divided token types should be somehow addressed if the suggested lazy-decoding feature is to be adapted. Late detection of errors would be a particularly annoying problem, I think. Well, I'm not sure if lazy-decoding is that worth pursuing given that parsing speed was not a primary concern of nlohmann/json, but I'll deeply appreciate if you can spend some time to seriously review about this issue. Same discussion might take place for string tokens, though I'm even less confident about them.

Thanks!

@nlohmann
Copy link
Owner

I think I would need to see examples to better understand this.

@nlohmann nlohmann added the state: needs more info the author of the issue needs to provide more details label Oct 23, 2018
@nlohmann
Copy link
Owner

I need an example to understand what should be done.

@nlohmann nlohmann closed this as completed Nov 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: needs more info the author of the issue needs to provide more details
Projects
None yet
Development

No branches or pull requests

2 participants