Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDR strings may not be valid utf8 #2022

Closed
tamirms opened this issue Dec 6, 2019 · 13 comments
Closed

XDR strings may not be valid utf8 #2022

tamirms opened this issue Dec 6, 2019 · 13 comments

Comments

@tamirms
Copy link
Contributor

tamirms commented Dec 6, 2019

XDR strings are defined as byte sequences where the first byte is the length of the string and the remaining bytes are the contents of the string.

https://docs.oracle.com/cd/E18752_01/html/816-1435/xdrproto-31244.html#xdrproto-38

We interpret XDR strings as unicode in horizon, the golang sdk, and the Java sdk. However, not all sequences of bytes are valid unicode strings.

This transaction https://horizon-testnet.stellar.org/transactions/213d52180f5e74fc4e4bc2c86740a44f98a40ab31e6718cc032e95f9084ae535 contains a memo field which is not a valid unicode string.

The problematic memo causes this issue in the Java SDK: lightsail-network/java-stellar-sdk#257

Also, the memo included in the horizon response does not match the memo you get when parsing the xdr envelope using the golang libraries:

func TestMemo(t *testing.T) {
	envelopeXDR := "AAAAAM6jLgjKjuXxWkir4M7v0NqoOfODXcFnn6AGlP+d4RxAAAAAZAAIiE4AAAABAAAAAAAAAAEAAAAcyKMl+WDSzuttWkF2DvzKAkkEqeSZ4cZihjGJEAAAAAEAAAAAAAAAAQAAAAAgECmBaDwiRPE1z2vAE36J+45toU/ZxdvpR38tc0HvmgAAAAAAAAAAAJiWgAAAAAAAAAABneEcQAAAAECeXDKebJoAbST1T2AbDBui9K0TbSM8sfbhXUAZ2ROAoCRs5cG1pRvY+ityyPWFEKPd7+3qEupavkAZ/+L7/28G"
	envelope := xdr.TransactionEnvelope{}
	err := xdr.SafeUnmarshalBase64(envelopeXDR, &envelope)
	if err != nil {
		t.Fatal(err)
	}

	memo, _ := envelope.Tx.Memo.GetText()
	response, err := http.DefaultClient.Get(
		"https://horizon-testnet.stellar.org/transactions/213d52180f5e74fc4e4bc2c86740a44f98a40ab31e6718cc032e95f9084ae535",
	)
	if err != nil {
		t.Fatal(err)
	}
	defer response.Body.Close()
	bodyBytes, err := ioutil.ReadAll(response.Body)
	if err != nil {
		t.Fatal(err)
	}
	tx := protocol.Transaction{}
	err = json.Unmarshal(bodyBytes, &tx)
	if err != nil {
		t.Fatal(err)
	}
	// the test fails here
	if tx.Memo != memo {
		t.Fatalf("'%s' != '%s'", tx.Memo, memo)
	}
}

Another issue with this transaction is that if you take the envelope XDR and decode it using the java SDK, then encode the result back to XDR, you will have a different XDR value than what you started out with:

    @Test
    public void testRoundtrip() throws IOException {
        String txBody = "AAAAAM6jLgjKjuXxWkir4M7v0NqoOfODXcFnn6AGlP+d4RxAAAAAZAAIiE4AAAABAAAAAAAAAAEAAAAcyKMl+WDSzuttWkF2DvzKAkkEqeSZ4cZihjGJEAAAAAEAAAAAAAAAAQAAAAAgECmBaDwiRPE1z2vAE36J+45toU/ZxdvpR38tc0HvmgAAAAAAAAAAAJiWgAAAAAAAAAABneEcQAAAAECeXDKebJoAbST1T2AbDBui9K0TbSM8sfbhXUAZ2ROAoCRs5cG1pRvY+ityyPWFEKPd7+3qEupavkAZ/+L7/28G";
        BaseEncoding base64Encoding = BaseEncoding.base64();
        byte[] bytes = base64Encoding.decode(txBody);

        TransactionEnvelope transactionEnvelope = TransactionEnvelope.decode(new XdrDataInputStream(new ByteArrayInputStream(bytes)));
        ByteArrayOutputStream byteOutputStream = new ByteArrayOutputStream();

        transactionEnvelope.encode(new XdrDataOutputStream(byteOutputStream));
        String serialized = base64Encoding.encode(byteOutputStream.toByteArray());
        // fails here
        assertEquals(serialized, txBody);
    }

Any ideas on what we can to deal with this problem?

@MonsieurNicolas
Copy link
Contributor

only clients that try to render those strings should try to decode them as utf8. see stellar/stellar-core#2310 on background

@leighmcculloch
Copy link
Member

leighmcculloch commented Dec 6, 2019

The XDR specification actually requires text to be encoded as ASCII but it seems we normalized on UTF-8, and so have others, e.g. the davecgh/go-xdr package we forked.

@tamirms Do you know what in the Go SDK is causing the string to be changed? Go shouldn't make any changes to the bytes when converting a []byte to a string. Even though all strings in Go are UTF-8, the bytes should remain intact through casting in that language at least.

We can fix the Java SDK decoding and re-encoding error by not round-tripping the byte[] through a String. I made a suggestion about that here: lightsail-network/java-stellar-sdk#257 (comment).


Could you post a full example demonstrating the issue? I can't figure out what all the imports are, e.g. protocol. It would be helpful to use https://play.golang.org or https://repl.it.

@tomerweller
Copy link
Contributor

tomerweller commented Dec 7, 2019

So it's pretty clear that we've been accepting arbitrary bytes which means the MEMO_TEXT field meant MEMO_BINARY.

The two viable solutions that I see here are:

  1. Modify our XDR libs to represent Strings as byte-arrays and have special behavior for encoding/decoding un-renderable bytes (non utf8). This will probably have the least effect on consumers but is pretty off-spec for XDR and thus incompatible with third party XDR libs. (//UPDATE: according to dm other xdr implementations also allow arbitrary bytes in string types)

  2. Modify our core XDR definitions to match the current behavior and spec. This means migrating the type from an XDR String type to an XDR Variable Length Opaque type. This is completely backwards compatible on the wire but will break consumers of the generated code.

I prefer the latter and believe the damage can be contained as most consumers use SDK utilities wrt memos, not the generated code (not sure though). SDK utils can be fixed when the xdr code is regenerated to allow for an (almost) smooth backwards compatible experience.

@bartekn
Copy link
Contributor

bartekn commented Dec 9, 2019

This transaction https://horizon-testnet.stellar.org/transactions/213d52180f5e74fc4e4bc2c86740a44f98a40ab31e6718cc032e95f9084ae535 contains a memo field which is not a valid unicode string.

Horizon actually removes all invalid UTF-8 byte sequences during ingestion because Postgres does not support inserting non-UTF-8 strings in character fields: https://github.com/stellar-deprecated/horizon/blame/master/src/github.com/stellar/horizon/db2/core/transaction.go#L57

The PR changing this is here: stellar-deprecated/horizon#353. I think it was done to support ingestion of txs with invalid UTF-8 strings because Postgres didn't accept it. Obviously not the perfect solution but I think the only alternative was a schema change devs wanted to avoid.

@tamirms
Copy link
Contributor Author

tamirms commented Dec 9, 2019

@tomerweller I prefer solution 2. Solution 1 would require us to update https://github.com/stellar/xdrgen to fix how strings are encoded / decoded. Anything requiring us to update https://github.com/stellar/xdrgen is not ideal

@ire-and-curses
Copy link
Member

I also prefer solution 2. In the Go SDK, XDR string memos are abstracted:

type Memo struct {
	Type    MemoType
	Text    *string `xdrmaxsize:"28"`
	Id      *Uint64
	Hash    *Hash
	RetHash *Hash
}

So I think clients will be insulated from the change.

@tomerweller
Copy link
Contributor

Horizon actually removes all invalid UTF-8 byte sequences during ingestion because Postgres does not support inserting non-UTF-8 strings

@bartekn This sounds pretty bad. It means that Memo ingestion is lossy.

@ire-and-curses
Copy link
Member

ire-and-curses commented Dec 9, 2019

@bartekn This sounds pretty bad. It means that Memo ingestion is lossy.

Well, it means that for the (presumably small) subset of non-UTF8 memos, submitting something not technically supported by either the XDR spec or current SDK implementations leads to Horizon displaying them in a mangled way. The original data presumably lives untouched on the blockchain. I don't love it, but I'm also not sure it's very important?

@leighmcculloch
Copy link
Member

Anything requiring us to update https://github.com/stellar/xdrgen is not ideal

@tamirms Why are we avoiding changing xdrgen? Changing the generated code in the Java SDK means any future time we regenerate the code we lose these changes. RE: lightsail-network/java-stellar-sdk#259 (comment)

@tamirms
Copy link
Contributor Author

tamirms commented Dec 9, 2019

@leighmcculloch did you see tomer's comment?

Modify our core XDR definitions to match the current behavior and spec. This means migrating the type from an XDR String type to an XDR Variable Length Opaque type. This is completely backwards compatible on the wire but will break consumers of the generated code.

@leighmcculloch
Copy link
Member

leighmcculloch commented Dec 9, 2019

Just a misunderstanding on my part, resolved here: lightsail-network/java-stellar-sdk#259 (comment).

@stanford-scs
Copy link
Contributor

I'm strongly opposed to modifying the XDR to convert all strings to opaque, as this throws away valuable semantic information about the protocol, and will make debugging harder.

While technically RFC4506 says a string is ASCII bytes, I have never seen an implementation that enforced this. Moreover, NFS2 and NFS3 allow non-ASCII characters in filenames represented as strings, so we have a pretty solid precedent here. I have seen implementations that do not work with 0-valued bytes in strings, even though NUL is a valid ASCII character, just because C uses NUL-terminated strings, but this is clearly not correct. I think the current behavior is acceptable, namely to consider a string to be an array of bytes without specifying the particular encoding.

The problems people are experiencing are really with language-specific representations of XDR, and should be fixed in the compiler and/or runtime.

tamirms added a commit to lightsail-network/java-stellar-sdk that referenced this issue Dec 13, 2019
The Java SDK assumes that all memo texts must be valid UTF 8 strings. However, that assumption is not valid. It turns out that any sequence of bytes is allowed to appear in the memo field (see stellar/go#2022 ).

Therefore, the correct representation for the memo field is a byte array.
@tamirms
Copy link
Contributor Author

tamirms commented Jan 17, 2020

Closing this issue because we opted not to change the type of all XDR string fields to opaque. Instead, the Java XDR generator has been modified to represent XDR strings as byte arrays: stellar/xdrgen#48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants