Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoder #2050

Closed
wants to merge 6 commits into from
Closed

Fix encoder #2050

wants to merge 6 commits into from

Conversation

Qiao-Jin
Copy link
Contributor

@Qiao-Jin Qiao-Jin commented Nov 5, 2020

@erikzhang
Copy link
Member

Unsafe?

@erikzhang
Copy link
Member

We need it to be deterministic and unambiguous. Otherwise, the consistency of each implementation cannot be maintained.

@Qiao-Jin
Copy link
Contributor Author

Qiao-Jin commented Nov 5, 2020

Unsafe?

Yes, with JavaScriptEncoder.UnsafeRelaxedJsonEscaping Property some special characters such as + will not be neglected during parsing. Its functionaity can be found here

@Qiao-Jin
Copy link
Contributor Author

Qiao-Jin commented Nov 5, 2020

We need it to be deterministic and unambiguous. Otherwise, the consistency of each implementation cannot be maintained.

Currently parsing a JObject is not deterministic, i.e. a "+" JString will be parsed to "\u002B"("\u002B" will still be parsed to "\u002B").

With this PR, "+" will be parsed to "+" and "\u002B" will also be parsed to "+".

(I assume we don't have cases where input strings includes "\u...."? If so we should recognize it as corresponding character though.)

@shargon
Copy link
Member

shargon commented Nov 5, 2020

@roman-khimov
Copy link
Contributor

roman-khimov commented Nov 5, 2020

It's not easy to tell all the side-effects of this change, but at the same time we don't have neo-project/neo-modules#375 problem at all even now:

$ curl -d '{ "jsonrpc": "2.0", "id": 5, "method": "invokefunction", "params": ["0x254b9decd76080ef368e7a6b0a065938dfbc31cf", "name", []] }' localhost:20332
{"id":5,"jsonrpc":"2.0","result":{"state":"FAULT","gasconsumed":"1007390","script":"EMAMBG5hbWUMFM8xvN84WQYKa3qONu+AYNfsnUslQWJ9W1I=","stack":[],"exception":"error encountered at instruction 30 (SYSCALL): failed to invoke syscall: contract not found"}}

(I don't have this contract, obviously, but the script generated has + inside) and this quick test also works fine for me:

--- a/pkg/core/interop/json/json_test.go
+++ b/pkg/core/interop/json/json_test.go
@@ -58,6 +58,13 @@ func TestSerialize(t *testing.T) {
        })
        t.Run("Deserialize", func(t *testing.T) {
                t.Run("Good", getTestFunc(deserializeID, []byte("42"), 42))
+               t.Run("HěnHǎo", getTestFunc(deserializeID, []byte("{\"测试\":   true}"),
+                       stackitem.NewMapWithValue([]stackitem.MapElement{
+                               {
+                                       Key:   stackitem.Make("测试"),
+                                       Value: stackitem.Make(true),
+                               },
+                       })))
                t.Run("Bad", getTestFunc(deserializeID, []byte("{]"), nil))
        })
 }

@Qiao-Jin
Copy link
Contributor Author

Qiao-Jin commented Nov 6, 2020

C# and Go are different in encoding: https://medium.com/swlh/utf-8-encoding-in-go-14b459564ccd

@Qiao-Jin
Copy link
Contributor Author

Qiao-Jin commented Nov 9, 2020

Ping

Copy link
Member

@erikzhang erikzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must completely determine the logic of the encoder, so as to ensure the consistency of all implementations. Otherwise, we have to create a custom encoder. Also, I don't like unsafe.

@roman-khimov
Copy link
Contributor

I'd rather merge it as is, because it solves the problem and the behavior is documented. It's "unsafety" is also documented well:

Use the unsafe encoder only when it's known that the client will be interpreting the resulting payload as UTF-8 encoded JSON. For example, you can use it if the server is sending the response header Content-Type: application/json; charset=utf-8. Never allow the raw UnsafeRelaxedJsonEscaping output to be emitted into an HTML page or a <script> element.

I think we're perfectly fine with this.

@erikzhang
Copy link
Member

Unlike the Default encoding (which only allows UnicodeRanges.BasicLatin), using this encoder instance allows UnicodeRanges.All to go through unescaped.

UnicodeRanges.All means,

Gets a range that consists of the entire Basic Multilingual Plane (BMP), from U+0000 to U+FFFF).

I don't know, is it safe for us to allow from U+0000 to U+FFFF to go through unescaped?

@shargon shargon self-requested a review November 12, 2020 08:33
@roman-khimov
Copy link
Contributor

I don't know, is it safe for us to allow from U+0000 to U+FFFF to go through unescaped?

It's going to be valid UTF-8 JSON in the end, so as long as

it's known that the client will be interpreting the resulting payload as UTF-8 encoded JSON

I don't think it's a problem.

@erikzhang
Copy link
Member

It's going to be valid UTF-8 JSON in the end, so as long as

Have you considered those control characters?

@roman-khimov
Copy link
Contributor

Mimicking JObject serialization (yes, 0x20 is not in control range):

using System;
using System.IO;
using System.Text.Encodings.Web;
using System.Text.Json;

public class Program
{
        public static void Main()
        {
                string controls = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20";
                MemoryStream ms = new MemoryStream();
                Utf8JsonWriter writer = new Utf8JsonWriter(ms, new JsonWriterOptions
            {
                SkipValidation = true,
                Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping
            });
                writer.WriteStringValue(controls);
                writer.Flush();
                byte[] res = ms.ToArray();
                string result = System.Text.Encoding.UTF8.GetString(res);
                Console.WriteLine(result);
        }
}

Output:

"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000B\f\r\u000E\u000F\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001A\u001B\u001C\u001D\u001E\u001F "

For the fun of it, Go version:

package main

import (
        "encoding/json"
        "fmt"
        "os"
)

func main() {
        var controls = "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20"
        b, err := json.Marshal(controls)
        if err != nil {
                fmt.Println(err)
                os.Exit(1)
        }
        fmt.Println(string(b))
}

Output:

"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\u0008\t\n\u000b\u000c\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f "

That turns out to be a little less fun than was intended as upper/lower case and escapings differ a bit...

@Tommo-L
Copy link
Contributor

Tommo-L commented Nov 13, 2020

Here is the python result

import json
p = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f\x20'
print(json.dumps(p))

output:

\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f

It's the same result with C#.

@roman-khimov
Copy link
Contributor

It's the same result with C#.

Except it's not exactly the same. \u001F and \u001f are very different if you're to store the result of this into contract's storage and then compute state hash over it. While it's not a big deal for JObject (to serialize data for JSON API communication, for example), it's a problem for System.Json.Serialize, so we can introduce an additional constraint there and refuse to serialize data containing control characters except '\n' and '\t' (and '\r' maybe).

@roman-khimov
Copy link
Contributor

And to make it even more fun try replacing controls with \xff \xd0 (invalid UTF-8).

C#: "ÿ Ð"
Go: "\ufffd \ufffd"
Python 2: fails with UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Python 3: "\u00ff \u00d0"

@Qiao-Jin Qiao-Jin closed this Nov 26, 2020
@erikzhang erikzhang mentioned this pull request Feb 7, 2021
@Qiao-Jin
Copy link
Contributor Author

Qiao-Jin commented Feb 8, 2021

Open this again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JObject is encoded by unicode which cause ToJson() often shows \u002
5 participants