Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to know if a string was parsed as utf-8? #406

Closed
pboettch opened this issue Dec 29, 2016 · 14 comments
Closed

How to know if a string was parsed as utf-8? #406

pboettch opened this issue Dec 29, 2016 · 14 comments

Comments

@pboettch
Copy link
Contributor

pboettch commented Dec 29, 2016

For my schema-validator I needed to check the length of a string value. std::length() gives the character-count which is not OK if the string is utf-8.

I wrote my own-function which works for ascii and utf-8.

Could I do it differently? Should nlohmann::json somehow inform (with a method) me about the fact that a unicode-string had been parsed?

@nlohmann
Copy link
Owner

I do not really understand the issue. Can you please provide an example where std::basic_string::length are not sufficient?

@pboettch
Copy link
Contributor Author

Of course. This code

std::cerr << "string: " << instance << ", "
          << "length: " << instance.get<std::string>().length() << ", "
          << "size: " << instance.get<std::string>().size() << ", "
          << "utf8-size: " << utf8_length(instance) << "\n";

gives

string: "💩💩", length: 8, size: 8, utf8-size: 2

on

"data":"\uD83D\uDCA9\uD83D\uDCA9"

The validator expects 2. I know this is not a JSON-HPP issue so I'm unsure who to blame ;-) .

@nlohmann
Copy link
Owner

I see. I wonder if your function is actually correct in counting the UTF-8 characters - is it really so simple?

@nlohmann
Copy link
Owner

@nlohmann
Copy link
Owner

From my point of view, I think this counting issue is out of scope of this library. Though a "count UTF-8 character" function is handy, I fear that it may bloat the API.

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Dec 30, 2016
@pboettch
Copy link
Contributor Author

pboettch commented Jan 3, 2017

This library parses Unicode and UTF-8-strings silently into a std::string. Thus, one should never use size() or length() (== byte-count) to check the string-length but a function similar to the one I'm using. Always.

A method (bool is_utf8()) could indicate whether this is a UTF-8-string or not. This information could then be used to check the size in a correct manner.

Maybe explaining it in the documentation is enough.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2017

I don't quite understand: JSON is defined to used Unicode (though this library only supports UTF-8), so I would not know what except true to return for is_utf8. I understand that you'd like either a proper character/glyph/whatever count (which std::string::size() will not be able to provide) or at least a bool contains_multibyte_encoded_codepoints() function.

Am I wrong?

@pboettch
Copy link
Contributor Author

pboettch commented Jan 3, 2017

I'd like to know which counting method I need to apply based on what and how it has been parsed into the std::string.

The utf-8-counting method works, but needs to be located on the user-side.

How to prevent users in the future from falling into the same trap as I did? How many users really need the real character-count and are not aware of multibyte-encoding-problems?

@jaredgrubb
Copy link
Contributor

std::string has no concept of encoding. You can put UTF8, ISO8859-1, UCS2, UTF32, or whatever you like into a std::string. You have to keep track of the encoding external to the string (or, better, just assume UTF8 everywhere and convert from/to it at the "boundaries" of your program). If your program has to handle data and doesn't know what the encoding is, there are algorithms that can try to guess, but they're not foolproof and you're in scary territory at that point. There are very few cases where you should be unsure of what you're getting -- a text editor or web browser is a good legitimate example, but there are many bad ones, and you should never guess without giving User UI to have a user confirm what you've done.

I don't think adding Unicode tools to a JSON library is helpful. It's a slippery slope (for example, counting code points can include or not include the "combining" modifiers like ◌ͤ, handling surrogates, coalation, normalization, locales, etc). There are entire C++ libraries for Unicode handling because it's hard, and if you need them, you should use them -- even for "simple" UTF8.

@nlohmann
Copy link
Owner

nlohmann commented Jan 3, 2017

I agree with @jaredgrubb. All the library can do is to document that it in fact stores strings as UTF-8 and the user has an interface to the stored bytes as std::string. Anything beyond this (i.e., providing a string type with a nice Unicode-friendly interface) is out of scope of a JSON library.

@pboettch
Copy link
Contributor Author

pboettch commented Jan 4, 2017

Coming back to my original question: How to know if a string was parsed as utf-8? The answer is: you don't, but you should assume that within this library std::string-value is always multibyte-encoded and take the necessary precautions.

@nlohmann
Copy link
Owner

nlohmann commented Jan 4, 2017

So it's a documentation issue?

@nlohmann nlohmann added documentation and removed state: please discuss please discuss the issue or vote for your favorite option labels Jan 4, 2017
@nlohmann
Copy link
Owner

nlohmann commented Jan 4, 2017

I shall add notes to the documentation about the encoding of the stored strings.

@nlohmann nlohmann added this to the Release 2.0.11 milestone Jan 4, 2017
@nlohmann nlohmann self-assigned this Jan 4, 2017
@nlohmann
Copy link
Owner

nlohmann commented Jan 4, 2017

@nlohmann nlohmann closed this as completed Jan 4, 2017
@nlohmann nlohmann modified the milestones: Release 2.0.11, Release 2.1.0 Jan 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants