-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The vanilla client will allow non-utf8 strings #39
Comments
That is... an extremely annoying feature. Is it possible that you could parse it as a byte array? Or does it need to masquerade as a string? |
This is because NBT uses Modifed UTF8, not standard UTF-8 for encoding strings. You can find details of it on the following links:
You can think of this as encoding a string using UTF16 and then each u16 is then encoded individually into UTF8. On top of this, null characters (U+0000) are encoded as two bytes (C0 80). This results in no null bytes in the encoded form. There are some crates out there for modified UTF8, however I can't speak for the quality of them. I would also be happy to provide a functioning implementation directly to this crate however I am fairly new to Rust so it may not be the most efficient implementation. |
Thank you for an excellent summary, @DrHenchman. In my view, there a couple ways we could try to fix this, some of which you are referring to indirectly:
Of these, (3) is the only one that does not make the ergonomics of using the package significantly more painful. It would also be difficult to actually implement and may have fairly large performance implications. But I do not think we should go down this path, for a few reasons. The actual NBT specification explicitly gives the encoding of strings as "UTF-8", so it is Minecraft itself breaking the rules here. Of course, Minecraft is the implementation most users care about, but I have a strong suspicion that other existing NBT parsers/tools follow the spec and do not handle this case correctly, either. I looked at a few other popular parsers to confirm this. Plus, I'm not sure we actually want to handle "string" data that explicitly embeds a null byte anyway. I cannot think of a legitimate reason player data would include it, and suspect that it is due to corruption, bizarre input error, or malice. Surely the correct thing to in @C4K3's case is remove the null byte from the input file. As a result, I am going to close this as "won't fix" for now, although I am not 100% opposed to reversing this position if we encounter related problems in the future. |
Null bytes is one example of something which doesn't work and I totally get your justification for not handling it. However, there are plenty of other characters which will encode differently. If you don't modify the handling of null bytes, but still wish to be compatible with Minecraft https://en.wikipedia.org/wiki/CESU-8 is essentially what this is. The wiki page gives a great example of how this works. Under the example section on this wiki page it actually gives a good example of how this differs. This is essentially for unicode characters which would normally be encoded as 4-bytes long in modified UTF-8 are encoded as 6-bytes. Perhaps I misunderstand what this library is for. You indicated that it is a strict implementation of an NBT specification. I was under the impression it was created to decode Minecraft save files so if in doubt, between that Mojang has documented as the specification (I wouldn't be surprised if Mojang documented it incorrectly) and what Minecraft does, what Minecraft does is correct. Of course, you are more than welcome to do whatever you wish with your crate. |
I completely misread the wiki article on modified UTF-8. I had thought it only treated null bytes as a special case, but I see now that it seems to treat all code points above Incidentally, this means the real cause of @C4K3's strange input file is likely legitimate text outside the basic multilingual plane (perhaps in an obscure ancient African writing system), and not random null bytes as I had assumed.
To clarify: I was mostly complaining about the fact that they violated the original spec, not offering it up as a reason to avoiding fixing these problems. (And I remain extremely annoyed that Minecraft does this. The Unicode Consortium itself strongly discourages the use of CESU-8/MUTF-8, and yet Java uses it anyway.) But you are right to argue that making an effort to address this edge case is the correct thing to do. It looks like there is a nice crate for handling modified UTF-8 already. We should be able to incorporate it into the internal parser without too much trouble. Honestly, I'm a bit surprised this hasn't surfaced more often with other popular NBT tools (including |
I looked at the
As a consequence:
This seems reasonable for NBT, as it will accept even improperly encoded strings from other tools (or NBT encoded by an older version of this crate) Encoding always goes through the CESU-8 encoder, so \x00 becomes C0 80 and supplementary codepoints use the CESU-8 surrogate encoding. This should be compatible with Minecraft, but it's a change from the previous version of this crate. If this is what you want and no work started yet, i'd be willing to try to make a PR for this. |
Yep, seems to have done it. Thanks! |
I'm dealing with a player .dat file containing a book with non-utf8 content, generated by the vanilla client.
I don't know when it's possible, but apparently the vanilla client may allow non-utf8 data.
It'd be nice to somehow allow parsing this anyway.
The text was updated successfully, but these errors were encountered: