Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message parser should be able to support arbitrary whitespace such as '\n', '\t', '\r', and ' ' within and between messages #35

Closed
addievo opened this issue Sep 15, 2023 · 13 comments

Comments

@addievo
Copy link

addievo commented Sep 15, 2023

How do I configure the stream parser to be able to discard whitespace between JSON messages?

I am facing an issue during implementation of a RPC system, which requires usage of JSONParser to parse binary to JSON for transmitting thru RPC. The issue I am facing is that input streams could be separated by a variety of different whitespaces, and the current implementation of seperator in the lib only supports a single separator.

Assuming as such, we have to keep out input streams in the following manner :
{...message}{...message}.

However, to improve readability, we wish to be able to add whitespaces in between messages, in a similar manner as demonstrated below.

// Normal separation
{...message}{...message}{...message}

// Spaces
{...message} {...message}    {...message}

// New lines
{...message}
{...message}
{...message}

// Any combination
{...message}                                  {...message}
                       {...message}
{...message}
@addievo addievo changed the title Message parser can support arbitrary whitespace such as '\n' within and between RPC messages Message parser can support arbitrary whitespace such as '\n' within and between messages Sep 15, 2023
@addievo addievo changed the title Message parser can support arbitrary whitespace such as '\n' within and between messages Message parser should be able to support arbitrary whitespace such as '\n', '\t', '\r', and ' ' within and between messages Sep 15, 2023
@juanjoDiaz
Copy link
Owner

Hi @addievo,

There is a separator option that you can use to set whatever you want to be the separation between the JSON objects.
You could set the separator to /n, /n/n or whatever you want.

Would that work for you?

@addievo
Copy link
Author

addievo commented Sep 16, 2023

Hi @juanjoDiaz,

Thank you for your reply. While the separator option does exist, it appears limited to recognizing only one type of separator between JSON messages. In our use-case, we need the parser to accommodate a range of whitespace characters such as '\n', '\t', '\r', and ' ' as valid separators between messages. Is it possible to extend the functionality to support multiple types of separators?

@CMCDragonkai
Copy link

@juanjoDiaz is it possible to make separator option into a regex string? That way we could easily capture any kind of characters we expect to be whitespace?

@juanjoDiaz
Copy link
Owner

I think that regex is too flexible.
It will have a big performance impact for such a concrete use case.

I'm thinking of allowing an array of separators.
Since you can chain as many separators as you want, it will solve your case.

@juanjoDiaz
Copy link
Owner

I implemented this but the performance impact is too big for me to accept it.

So I need to think of a way to keep the current performance for people that is fine working with a single separator while allowing multiple separators for those that can accept the performance impact.

@CMCDragonkai
Copy link

CMCDragonkai commented Sep 20, 2023

Wouldn't the performance only be impacted if there were actual whitespace between the json strings? And also if I give an array of 1 character wouldn't that be the same as just specifying a single character?

And when you say we can chain whitespace, do you mean that if there is more than 1 whitespace character in the stream, it will auto-drop them all?

@addievo
Copy link
Author

addievo commented Sep 20, 2023

Hey @juanjoDiaz,

What about starting off with a quick check to see how many types separators are in the array? If it's just one, we could stick to the fast route we already have. If there's more, then we switch to the slower, but more flexible, multi-separator logic. This way, we get the best of both worlds without a big hit on performance. What do you think?

@CMCDragonkai
Copy link

Just to clarify the input stream may have 0 or more whitespace characters. The idea is for the stream parser to completely remove them all regardless of how many there is. So we cannot fix the number of expected whitespace separators like '\n\n'.

@juanjoDiaz
Copy link
Owner

Wouldn't the performance only be impacted if there were actual whitespace between the json strings? And also if I give an array of 1 character wouldn't that be the same as just specifying a single character?

During tokenization, we need to check for every character to see if it is a separator o no.
Checking a single character is very fast.
Checking a sequence of characters is a bit slower but still fast.
Checking against a list of possible characters or character sequences is also slower.

Considering that you might be streaming a million JSON object separated by a simple character, the overhead of checkign against the list piles up and becomes very noticeable.

And when you say we can chain whitespace, do you mean that if there is more than 1 whitespace character in the stream, it will auto-drop them all?

Yes. If your separator is "\n", any number of new lines will be skipped. (e.g.: \n\n\n\n).
There is a unit test specific for this.

The problem is with matching "\n" or "\s".

What about starting off with a quick check to see how many types separators are in the array? If it's just one, we could stick to the fast route we already have. If there's more, then we switch to the slower, but more flexible, multi-separator logic. This way, we get the best of both worlds without a big hit on performance. What do you think?

Yes. This is what I want to try next.

@addievo
Copy link
Author

addievo commented Sep 21, 2023

Hey @juanjoDiaz,

Thanks a bunch for your reply and assistance.
Looking forward to having the array based separator.

@KWLandry-acoustic
Copy link

Hi @juanjoDiaz ,
I think I might be running into something similar,
What about a 'Transform' callback where I could parse the chunk coming through in the read stream and 'transform' any whitespace or other characters as in chunk.replace(x,y) (or whatever would be appropriate for what's actually made available)
This places the performance hit on my code (with plenty of caveats in the documentation for this capability).
Of course, I would want access to the whole chunk or buffer about to be parsed not each character at a time.

@juanjoDiaz
Copy link
Owner

Hi @KWLandry-acoustic ,

You issue is not related.

I just realized that I'm an idiot and this feature was supported since the very beginning 🤦‍♂️

You just need to set the separator to the empty string '' and all whitespaces will be allowed as separator.

@KWLandry-acoustic
Copy link

@juanjoDiaz Excellent, thanks!
I do remember reading this somewhere, in the docs perhaps, but it did not translate, or actually I remember I ran into a problem with it and had to specify a separator,
Regardless, in any case, I'll set and test that,
Thanks,
KWL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants