-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String DSL should support valid UTF-8 #54
Comments
I'm not quite sure what you mean by "I get back invalid UTF-8". As far as I understand it Java uses UTF-16 to encode strings internally and you would specify a charset only when translating to bytes. |
Sorry for not replying for a long time. I have a lot of use cases which deal with serialization, so I want to make sure UTF-8 strings are serialized and deserialized without loss; there is a large assumption in most of my code that the original string is valid UTF-8. What I find when I use the code above is that the deserializing the string returns a different value, so the two strings are no longer Looking up the UTF-8 code points, I see the max defined UTF 8 value is 99k but StringDSL defines 65k. I could totally be reading everything wrong (I use UTF-8, I don't know the spec at all =D), but that would imply to me that I should always get back UTF-8 chars; yet for some reason the string comes back as invalid UTF 8 and the My common use case is to deal with |
I dug a bit deeper into the problem and checked for which codepoints forth and back conversion does not produce the same chars. The smallest one I found was 0xD800 which is the beginning of an area where Unicode does currently have no defined characters (see https://unicode-table.com). So, maybe a better approach than providing a specialised UTF8 generator could be to (optionally) filter out all codepoints that have no defined character in unicode, e.g. like that
|
I find that if I use the
basicMultilingualPlaneAlphabet
from the string dsl that I get back invalid UTF-8; to generate a UTF-8 gen I have the following in my codeThis conversion to bytes and back will drop all non-valid code points.
The text was updated successfully, but these errors were encountered: