Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The vanilla client will allow non-utf8 strings #39

Closed
C4K3 opened this issue Mar 2, 2019 · 8 comments
Closed

The vanilla client will allow non-utf8 strings #39

C4K3 opened this issue Mar 2, 2019 · 8 comments

Comments

@C4K3
Copy link

C4K3 commented Mar 2, 2019

I'm dealing with a player .dat file containing a book with non-utf8 content, generated by the vanilla client.

I don't know when it's possible, but apparently the vanilla client may allow non-utf8 data.

It'd be nice to somehow allow parsing this anyway.

@atheriel
Copy link
Collaborator

atheriel commented Mar 5, 2019

That is... an extremely annoying feature. Is it possible that you could parse it as a byte array? Or does it need to masquerade as a string?

@DrHenchman
Copy link

This is because NBT uses Modifed UTF8, not standard UTF-8 for encoding strings. You can find details of it on the following links:

You can think of this as encoding a string using UTF16 and then each u16 is then encoded individually into UTF8. On top of this, null characters (U+0000) are encoded as two bytes (C0 80). This results in no null bytes in the encoded form.

There are some crates out there for modified UTF8, however I can't speak for the quality of them. I would also be happy to provide a functioning implementation directly to this crate however I am fairly new to Rust so it may not be the most efficient implementation.

@atheriel
Copy link
Collaborator

atheriel commented Jun 9, 2019

Thank you for an excellent summary, @DrHenchman. In my view, there a couple ways we could try to fix this, some of which you are referring to indirectly:

  1. Turn all strings in the API into byte vectors and require that users translate them into UTF-8 String types manually.
  2. Turn all strings in the API into a new, modified-UTF-8-supporting type; or
  3. Modify deserialization internally to handle the two-byte null encoding and parse the remainder into valid Rust strings (which must be UTF-8).

Of these, (3) is the only one that does not make the ergonomics of using the package significantly more painful. It would also be difficult to actually implement and may have fairly large performance implications.

But I do not think we should go down this path, for a few reasons.

The actual NBT specification explicitly gives the encoding of strings as "UTF-8", so it is Minecraft itself breaking the rules here. Of course, Minecraft is the implementation most users care about, but I have a strong suspicion that other existing NBT parsers/tools follow the spec and do not handle this case correctly, either. I looked at a few other popular parsers to confirm this.

Plus, I'm not sure we actually want to handle "string" data that explicitly embeds a null byte anyway. I cannot think of a legitimate reason player data would include it, and suspect that it is due to corruption, bizarre input error, or malice. Surely the correct thing to in @C4K3's case is remove the null byte from the input file.

As a result, I am going to close this as "won't fix" for now, although I am not 100% opposed to reversing this position if we encounter related problems in the future.

@DrHenchman
Copy link

Null bytes is one example of something which doesn't work and I totally get your justification for not handling it. However, there are plenty of other characters which will encode differently. If you don't modify the handling of null bytes, but still wish to be compatible with Minecraft https://en.wikipedia.org/wiki/CESU-8 is essentially what this is. The wiki page gives a great example of how this works.

Under the example section on this wiki page it actually gives a good example of how this differs. This is essentially for unicode characters which would normally be encoded as 4-bytes long in modified UTF-8 are encoded as 6-bytes.

Perhaps I misunderstand what this library is for. You indicated that it is a strict implementation of an NBT specification. I was under the impression it was created to decode Minecraft save files so if in doubt, between that Mojang has documented as the specification (I wouldn't be surprised if Mojang documented it incorrectly) and what Minecraft does, what Minecraft does is correct.

Of course, you are more than welcome to do whatever you wish with your crate.

@atheriel
Copy link
Collaborator

I completely misread the wiki article on modified UTF-8. I had thought it only treated null bytes as a special case, but I see now that it seems to treat all code points above 0xFFFF (the "basic multilingual plane") differently as well.

Incidentally, this means the real cause of @C4K3's strange input file is likely legitimate text outside the basic multilingual plane (perhaps in an obscure ancient African writing system), and not random null bytes as I had assumed.

Perhaps I misunderstand what this library is for. You indicated that it is a strict implementation of an NBT specification. I was under the impression it was created to decode Minecraft save files

To clarify: I was mostly complaining about the fact that they violated the original spec, not offering it up as a reason to avoiding fixing these problems. (And I remain extremely annoyed that Minecraft does this. The Unicode Consortium itself strongly discourages the use of CESU-8/MUTF-8, and yet Java uses it anyway.) But you are right to argue that making an effort to address this edge case is the correct thing to do. It looks like there is a nice crate for handling modified UTF-8 already. We should be able to incorporate it into the internal parser without too much trouble.

Honestly, I'm a bit surprised this hasn't surfaced more often with other popular NBT tools (including nbted, written by @C4K3), none of which seem to handle anything other than pure UTF-8.

@MightyPork
Copy link

MightyPork commented Jun 16, 2019

I looked at the cesu8 crate and tested it a bit, and it seems to be reasonably well made.
There's some interesting cases though:

  • ASCII and low UTF-8 are parsed efficiently using the std::str::from_utf8, with CESU-8 as a fallback.

As a consequence:

  • NUL byte is accepted in the input, alongside C0 80, both becoming \x00
  • Proper UTF-8 is accepted normally, alongside the CESU-8 surrogate encoding

This seems reasonable for NBT, as it will accept even improperly encoded strings from other tools (or NBT encoded by an older version of this crate)

Encoding always goes through the CESU-8 encoder, so \x00 becomes C0 80 and supplementary codepoints use the CESU-8 surrogate encoding. This should be compatible with Minecraft, but it's a change from the previous version of this crate.

If this is what you want and no work started yet, i'd be willing to try to make a PR for this.

@atheriel
Copy link
Collaborator

Yes, I think it makes sense to be permissive on input and correct on output. I've implemented the change in a small patch, 6da900b, and it seems to work correctly. Thanks for all of your contributions to this discussion.

@C4K3 Does your file decode correctly with the master branch?

@C4K3
Copy link
Author

C4K3 commented Jun 17, 2019

Yep, seems to have done it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants