Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep parameter to dump to copy invalid UTF-8 bytes as-is #4555

Draft
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

nlohmann
Copy link
Owner

@nlohmann nlohmann commented Dec 18, 2024

Add an enumerator keep to error_handler_t to allow to keep the input as-is in case of UTF-8 errors.

Fixes #4552

Copy link

🔴 Amalgamation check failed! 🔴

The source code has not been amalgamated. @nlohmann
Please read and follow the Contribution Guidelines.

@coveralls
Copy link

coveralls commented Dec 18, 2024

Coverage Status

coverage: 99.639%. remained the same
when pulling 4ab98c3 on issue4552-ignore
into 663058e on develop.

nlohmann and others added 7 commits December 18, 2024 17:46
* 🐛 set parents after insert call

* 🚨 fix warning
* Add implementation to retrieve start and end positions of json during parse

* Add more unit tests and add start/stop parsing for arrays

* Add raw value for all types

* Add more tests and fix compiler warning

* Amalgamate

* Fix CLang GCC warnings

* Fix error in build

* Style using astyle 3.1

* Fix whitespace changes

* revert

* more whitespace reverts

* Address PR comments

* Fix failing issues

* More whitespace reverts

* Address remaining PR comments

* Address comments

* Switch to using custom base class instead of default basic_json

* Adding a basic using for a json using the new base class. Also address PR comments and fix CI failures

* Address decltype comments

* Diagnostic positions macro (#4)

Co-authored-by: Sush Shringarputale <sushring@linux.microsoft.com>

* Fix missed include deletion

* Add docs and address other PR comments (#5)

* Add docs and address other PR comments

---------

Co-authored-by: Sush Shringarputale <sushring@linux.microsoft.com>

* Address new PR comments and fix CI tests for documentation

* Update documentation based on feedback (#6)

---------

Co-authored-by: Sush Shringarputale <sushring@linux.microsoft.com>

* Address std::size_t and other comments

* Fix new CI issues

* Fix lcov

* Improve lcov case with update to handle_diagnostic_positions call for discarded values

* Fix indentation of LCOV_EXCL_STOP comments

* fix amalgamation astyle issue

---------

Co-authored-by: Sush Shringarputale <sushring@linux.microsoft.com>
@github-actions github-actions bot added L and removed M labels Dec 22, 2024
@nlohmann nlohmann changed the title WIP for #4552 Add keep parameter to dump to copy invalid UTF-8 bytes as-is Dec 22, 2024
@nlohmann nlohmann marked this pull request as ready for review December 22, 2024 13:21
@nlohmann nlohmann added the review needed It would be great if someone could review the proposed changes. label Dec 22, 2024
@jordan-hoang
Copy link
Contributor

It looks good to me. I thought you would do json_string === s_kept, but you took it's substring at position 1 instead, I guess dump gives you an extra character in the beginning or something?

@nlohmann
Copy link
Owner Author

Yes, dump adds quotes.

docs/mkdocs/docs/api/basic_json/dump.md Outdated Show resolved Hide resolved
// copy string as-is if error handler is set to keep, and we don't want to ensure ASCII
if (error_handler == error_handler_t::keep && !ensure_ascii)
{
o->write_characters(s.data(), s.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for me to understand, how would this behave exactly? If there is a 0×dc byte for example, it will be escaped as \334 octal string (or \xdc hex, or similar)?

I think the important thing is to not break the json format.

And also what about other UTF-8 accepted chars? Like \b or \t handled below: how will they be dumped in this case?

I have limited access these days (from mobile), and I don't know exactly the purpose of ensure_ascii and its default value. If you could provide some hint it would be helpful.

Thank you

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right! Just copying the input to the output is wrong here, because valid characters like LF that must be escaped to \n would not be escaped and the resulting JSON would be invalid. Thanks for noting. I will fix this.

Co-authored-by: gentooise <and.genuise@gmail.com>
@nlohmann nlohmann removed the review needed It would be great if someone could review the proposed changes. label Dec 23, 2024
@nlohmann nlohmann marked this pull request as draft December 23, 2024 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 invalid characters are not always ignored when dumping with error_handler_t::ignore
6 participants