Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using () to delimit objects breaks auto-url-detectors #16

Open
wmertens opened this issue Mar 28, 2017 · 40 comments
Open

Using () to delimit objects breaks auto-url-detectors #16

wmertens opened this issue Mar 28, 2017 · 40 comments

Comments

@wmertens
Copy link

if you embed a jsurl object result in a url as the last component, you get something like http://example.com/foo?q=~(a~'test), and if you paste that somewhere, there's a good chance that the url up but not including the final ) is recognized.

One option is adding a final ~, that fixes it?

@bjouhier
Copy link
Member

I could implement this in a v2 but the problem is that a string produced by a v2 will fail to parse with a v1 parser. So far I have resisted making changes because I did not want to break protocols that use jsurl.

@wmertens
Copy link
Author

wmertens commented Mar 28, 2017 via email

@bjouhier
Copy link
Member

Our situation is different because our app has several components that interact with jsurl and it is more difficult to move them all at once (especially as our components are deployed on-premise). So we need to preserve interop.

But I'm not opposed to fixing the issues with a v2. We should solve all the pending issues at once (encoded quote and trailing ~) so that we don't have to move again later.

@wmertens
Copy link
Author

wmertens commented Mar 29, 2017 via email

@bjouhier
Copy link
Member

I was thinking about less invasive changes. I would like to keep the parentheses. If we add a ~ at the end, do we still have a problem with parentheses?

I want the encoded string to be unaltered by encodeURIComponent (this was a strong requirement for v1). This limits the character set to ascii alpha + ascii digits + - _ . ! ~ * ' ( ) (uriUnescaped in https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3) and I would restrict even further, and eliminate '. This rules out characters like = / |.

So I'm proposing the following changes:

  • add a ~ as the end, to keep the auto-url-detectors happy. This trailing char can also be used to distinguish between v1 and v2
  • replace ' by !, to avoid browser encoding.
  • maybe a few special * escapes. I like *_ for space, maybe *- for $ (frequent in object keys because it is valid in js identifiers) but I would not go much further because gain is small and result quickly becomes cryptic.

@wmertens
Copy link
Author

wmertens commented Mar 31, 2017 via email

@wmertens
Copy link
Author

one more optimization: change repeating final ~ to a single ~, and to grab a value search until ~ or end of string. Then the standard example becomes _name~John_Doe~age~42~children~.Mary~Bill~

@wmertens wmertens mentioned this issue Apr 1, 2017
10 tasks
@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Lots of good ideas here but I want to understand why you want to get rid of parentheses. Lots of URLs have parentheses, and parentheses are a good visual clue for nested substructures.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

More detailed comments:

  • all values terminate with ~ OK
  • true, false, null become -T~, -F~, -N~ OK
  • numbers start with - (+ digit) or a digit and end with ~ OK
  • strings start with alpha or * (the only extra non-unreserved character
    we use) and terminate with ~ OK
    • strings internally get space replaced by _ (common and very
      readable), * by **, _ by *_, ~ by *-, % by *. and any others we like OK for space - others need discussion
    • I don't think we need *XX and *XXXX encoding, that will be done by
      uriencoding whenever actually needed. Lots of common characters can be
      replaced by *+single char KO - jsurl shouldn't rely on a uriencoding pass
    • Empty string is *~ OK - clever
  • objects start with _, arrays start with ., both terminate with ~. I'd like to keep parens, at least around objects
    • object keys are encoded as strings, so no starting * needed, only * escaping is done OK
      • [1, 2] becomes .1~2~~ **
      • {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
        _a~fo*.o~*_test~**_hm**h*-m~5~.1~-T~~~

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

When would parentheses get escaped? They are uriUnescaped (but ' was too) and I have never seen them being escaped.

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

There is a problem with strings starting with a number. How do you encode "0"?

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

We could keep ! too. Then I'd rather do the following:

  • true, false, null become T~, F~, N~ (shorter, and leading - felt strange).
  • strings start with !.

*0~ feels like a hack. What about "20"? It cannot be *20~ as this would be space. Is it *2*0~? Will be bad for us because we are passing decimal values as strings to avoid precision pb with js numbers.

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Parentheses are not uriReserved, they are uriUnescaped.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017

So the code works by the fact that at the beginning of a value there are only a number of possible characters. All cases are in the if clauses as https://github.com/wmertens/jsurl/blob/4ffcdea624eb29070bd6c44510e438b46799e986/lib/jsurl2.js#L71 - I tried to optimize for stringified length. So strings only start with * (or ! if they are not unambiguously strings.

Parentheses are in section 2.2 "Reserved Characters" https://tools.ietf.org/html/rfc3986#section-2.2 - although wikipedia says that means they can be used. I must say, if I paste ! $ & ' ( ) * + , ; = in the URL bar in Chrome, only ' gets escaped, and behind a # none get escaped.

How about starting objects with ( but still terminating with ~?

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017

I must say, I really like the _ for space, it makes embedded spaces easy to read.

As for the URI encoding, I was reasoning thusly:

  • you have no control over URI encoding, and if it happens anyway, why not let the fast native functions do it? It can recover from it in any case.
  • If you let native handle it, then embedded unicode is readable in the address bar
  • It frees up escaped address space for other purposes; I'd rather escape common encoded chars in 2 chars instead of 3.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

And we could omit the leading ! for object keys if the key starts with alpha.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Point taken about generic URL RFC. I was referring to the specs for JS URL handling functions: https://www.ecma-international.org/ecma-262/5.1/#sec-15.1.3. I care most about the JS functions because that what's JS guys use to encode/decode.

I like _ for embedded space too.

OK for leaving non-ASCII chars as is instead of encoding with **. More compact and more readable.

I'd like to have the closing parenthesis at the end of objects too. The whole point is to trade a bit of compactness (one extra char at the end - wtf) for readability. Without it, it is very difficult to see where the object ends.

I had misunderstood the leading * in strings. I thought that it was the start of an escape sequence.

What about prefixing T, F and N by ! instead of -? I find the ("- followed by digit" vs. "- followed by letter" rule a bit too hacky).

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Note: with this, a non empty object looks like (<...>~)~ and a non empty array like .<...>~~. So we have an unambiguous end marker for objects ()~) and arrays (~~).

And then we could use _T, _F and _N because _ is not reserved for object start any more.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Summary of revised proposal:

  • all values terminate with ~
  • true, false, null become _T~, _F~, _N~
  • numbers start with - (+ digit) or a digit and end with ~
  • strings start with alpha or * (the only extra non-unreserved character
    we use) and terminate with ~
    • strings internally get space replaced by _ (common and very
      readable), * by **, _ by *_, ~ by *-, % by *..
    • I don't think we need *XX and *XXXX encoding, that will be done by
      uriencoding whenever actually needed.
    • Empty string is *~
  • objects start with ( and end with )~
  • arrays start with ., and end with ~
  • object keys are encoded as strings, so no starting * needed, only * escaping is done
    - [1, 2] becomes .1~2~~
    - {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
    (a~fo*.o~*_test~**_hm**h*-m~5~.1~_T~~)~

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

What about having arrays start with ~ rather than . and end with ~. As they usually follow another value, it gives them a nice ~~<...>~~ symmetry.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

I too was thinking of dropping the ~ after ). Only gotcha is the url-auto-detector issue that started this whole thing 😄.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Summarizing one more time:

  • all values terminate with ~ or )
  • true, false, null become _T~, _F~, _N~
  • numbers start with - (+ digit) or a digit and end with ~
  • strings start with alpha or * (the only extra non-unreserved character
    we use) and terminate with ~
    • strings internally get space replaced by _ (common and very
      readable), * by **, _ by *_, ~ by *-, % by *..
    • chars that need escaping are embedded as is. URI percent encoding will take care of them.
    • empty string is *~
  • objects start with ( and end with )
  • arrays start with ~, and end with ~
  • object keys are encoded as strings, so no starting * needed, only * escaping is done
    • [1, 2] becomes ~1~2~~
    • {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
      (a~fo*.o~*_test~**_hm**h*-m~5~~1~_T~~)

Closing characters (~ and )) could be dropped at the very end? This would solve the original problem but then parentheses are unbalanced.

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Good point. It also broke the test on leading ~ to distinguish v1 and v2.

I find . a bit too difficult to spot visually. Why not start arrays with ! then?

@wmertens
Copy link
Author

wmertens commented Apr 1, 2017 via email

@bjouhier
Copy link
Member

bjouhier commented Apr 1, 2017

Getting there. Here it comes:

  • all values terminate with ~ or )
  • true, false, null become _T~, _F~, _N~
  • numbers start with - (+ digit) or a digit and end with ~
  • strings start with alpha or * and terminate with ~
    • strings internally get space replaced by _, * by **, _ by *_, ~ by *-, % by *..
    • chars that need URL escaping are embedded as is. URI percent encoding will take care of them.
    • empty string is *~
  • objects start with ( and end with )
  • arrays start with !, and end with ~
  • object keys are encoded as strings, so no starting * needed, only * escaping is done
  • closing characters (~ and )) may be dropped at the very end.

Regarding closing characters, the rule is a may. stringify has an option to control whether they are emitted or not. Parser does not have an option and accepts input with or without them.

Examples:
* [1, 2] becomes !1~2~~ or !1~2
* {"a": "fo%o", "_test": "_hm*h~m", "5": [1, true]} becomes
(a~fo*.o~*_test~**_hm**h*-m~5~!1~_T~~) or (a~fo*.o~*_test~**_hm**h*-m~5~!1~_T

@wmertens
Copy link
Author

wmertens commented Apr 2, 2017

Alright, I implemented this, look at the tests to see the results. I had to also escape () to allow unambiguous parsing of ), which also allowed me to drop the last ~ in objects.

@wmertens
Copy link
Author

wmertens commented Apr 2, 2017

I also made that shortening optional. I wonder if we should not leave a terminal ~ at all times, or maybe make that optional too.

I like how an object with booleans now looks like (doFoo~~withBar~~meep)~

@bjouhier
Copy link
Member

bjouhier commented Apr 2, 2017

Cool. I'll take a look but only tomorrow. Thanks.

@wmertens
Copy link
Author

wmertens commented Apr 2, 2017

Well, this was fun. I'm extremely happy to report that on my test object in Chrome at least, v2 now outperforms native JSON for both parsing and stringifying 😁

performance.html:15 JSON: 200000 parsed in 731ms, 0.003655ms/item
performance.html:23 JSON: 200000 stringified in 448ms, 0.00224ms/item
performance.html:32 v1: 200000 parsed in 1337ms, 0.006685ms/item
performance.html:40 v1: 200000 stringified in 934ms, 0.00467ms/item
performance.html:49 v2: 200000 parsed in 601ms, 0.003005ms/item
performance.html:57 v2: 200000 stringified in 403ms, 0.002015ms/item

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants