Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore #4552

Open
2 tasks
gentooise opened this issue Dec 17, 2024 · 9 comments · May be fixed by #4555
Open
2 tasks

UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore #4552

gentooise opened this issue Dec 17, 2024 · 9 comments · May be fixed by #4555
Assignees
Labels
confirmed documentation kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation

Comments

@gentooise
Copy link

Description

According to this: https://json.nlohmann.me/api/basic_json/dump/#parameters , when passing error_handler_t::ignore to dump() function, invalid UTF-8 characters should be ignored and copied as-is into the final string.

However, I'm debugging the following minimal code:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

and the final test_dump string contains test\005 (byte \334 is gone).

image

Is this expected? Am I missing something?

Reproduction steps

Just try to run/debug the following:

    std::string test = "test\334\005";
    nlohmann::json node{};
    node["test"] = test;
    auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Expected vs. actual results

Actual: test_dump contains test\005
Expected: test_dump contains test\334\005

Minimal code example

std::string test = "test\334\005";
nlohmann::json node{};
node["test"] = test;
auto test_dump = node.dump(-1, ' ', false, nlohmann::json::error_handler_t::ignore);

Error messages

No response

Compiler and operating system

gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924

Library version

3.11.2

Validation

@nlohmann
Copy link
Owner

Yes, there seems to be a logic error. \334 is written to the string buffer, but then overwritten by \u0005. The reason for this is line

bytes = bytes_after_last_accept;

but bytes_after_last_accept was not incremented after reading \334.

@jordan-hoang
Copy link
Contributor

Interesting, I'll take a look perhaps.

@nlohmann
Copy link
Owner

I had another look at the issue and this needs some discussion:

  • Right now, the error_handler_t::ignore option has few test cases, but these in connection with the current implementation all ignore invalid UTF-8 sequences and do not follow the documentation "all bytes are copied to the output unchanged".
  • Therefore, I think fixing the semantics of the error_handler_t::ignore function to meet that sentence from the documentation would be very surprising for existing implementations as after the fix, suddenly more bytes appear in the output.
  • Instead, I would propose adding a new enumerator, maybe error_handler_t::copy or error_handler_t::keep, that implements an approach that meets the sentence in the documentation and actually preserves input even in case of UTF-8 errors.

What do you think?

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Dec 18, 2024
nlohmann added a commit that referenced this issue Dec 18, 2024
@jordan-hoang
Copy link
Contributor

I think adding a new enumerator would be the best in this case then. Fixing the semantics of error_handler_t::ignore would make anyone who currently depends on it's current behavior unhappy. and we can probably update the docs to make it more accurate.

@t-b
Copy link
Contributor

t-b commented Dec 18, 2024

Yes I also think adding a new enumerator which preserves invalid UTF8 makes sense.

@nlohmann nlohmann self-assigned this Dec 18, 2024
@nlohmann
Copy link
Owner

Thanks for the input. Any preference on the name? error_handler_t::keep?

@t-b
Copy link
Contributor

t-b commented Dec 19, 2024

Sounds good!

@jordan-hoang
Copy link
Contributor

Yeah keep sounds good.

@nlohmann
Copy link
Owner

@gentooise @t-b @jordan-hoang Please take a look at #4555.

@nlohmann nlohmann added solution: proposed fix a fix for the issue has been proposed and waits for confirmation and removed state: please discuss please discuss the issue or vote for your favorite option labels Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed documentation kind: bug solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants