Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CAMEL-21199] Camel-jackson not properly marshalling 4-byte characters #15515

Closed
wants to merge 1 commit into from

Conversation

rnetuka
Copy link
Contributor

@rnetuka rnetuka commented Sep 11, 2024

Issue: https://issues.apache.org/jira/browse/CAMEL-21199

Jackson doesn't handle 4-byte characters well. Marshalling a 4-byte Japanese kanji character results in two UTF-16 escapes to be written instead of the character itself. While this is ok for emoji an such, it's not for natural languages.

Jackson issue: FasterXML/jackson-core#223

Reproducer:

from("file:data?file-name=input.txt&noop=true")
    .log("${body}")
    .unmarshal().json(JsonLibrary.Jackson)
    .log("${body[0]['name']}")
    .marshal().json(JsonLibrary.Jackson, true)
    .log("${body}");

with the file input.txt containing:
[{"name": "システム𩸽"}]

Solution:
While it's an error in Jackson-core, it could be avoided by forcing Jackson to use WriterBasedJsonGenerator instead of UTF8JsonGenerator (see the original Jackson issue reproducer in the topmost comment)

Copy link
Contributor

🌟 Thank you for your contribution to the Apache Camel project! 🌟

🤖 CI automation will test this PR automatically.

🐫 Apache Camel Committers, please review the following items:

  • First-time contributors require MANUAL approval for the GitHub Actions to run

  • You can use the command /component-test (camel-)component-name1 (camel-)component-name2.. to request a test from the test bot.

  • You can label PRs using build-all, build-dependents, skip-tests and test-dependents to fine-tune the checks executed by this PR.

  • Build and test logs are available in the Summary page. Only Apache Camel committers have access to the summary.

  • ⚠️ Be careful when sharing logs. Review their contents before sharing them publicly.

@davsclaus davsclaus marked this pull request as draft September 11, 2024 11:04
@davsclaus
Copy link
Contributor

This should be fixed in jackson instead of workaround in camel that can affect others.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

@davsclaus I'll try to provide a fix for Jackson, but if you see the history, it's open from 2015 and there's not much effort to fix it from their side. Several PRs were even rejected as too dangerous for Jackson-core :(

@davsclaus
Copy link
Contributor

Yes and if its dangerous for jackson, then why is it not dangerous for camel ?

What you can/should do is to create a JIRA ticket, and add a new option to the dataformat, where you can configure this with a boolean (default false) and then only use this new code if enabled. Then existing is not affected at all.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

Yes and if its dangerous for jackson, then why is it not dangerous for camel ?

It was claimed dangerous by Jackson authors to mess with UTF8JsonGenerator (which containts the error, but is also fragile). This solution forces Jackson to use WriterBasedJsonGenerator inside, which seems to work fine.

What you can/should do is to create a JIRA ticket, and add a new option to the dataformat, where you can configure this with a boolean (default false) and then only use this new code if enabled. Then existing is not affected at all.

Seems like an overkill to me

@davsclaus
Copy link
Contributor

Yes and if its dangerous for jackson, then why is it not dangerous for camel ?

It was claimed dangerous for Jackson to dabble with UTF8JsonGenerator which is erronous. This solution forces Jackson to use WriterBasedJsonGenerator inside, which seems to work fine.

What you can/should do is to create a JIRA ticket, and add a new option to the dataformat, where you can configure this with a boolean (default false) and then only use this new code if enabled. Then existing is not affected at all.

Seems like an overkill to me, but if you like, I can surely add the option

If its so non dangerous why has jackson not fixed it already. Please respect the large existing user-base of Camel that dont want hacks in commonly used components.

Also we need unit tests to go with that thanks.

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

If its so non dangerous why has jackson not fixed it already

Don't ask me. Ask them :)

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

@davsclaus I agree that it should be fixed in Jackson. Sadly, it hasn't been fixed for a long time and they don't consider it much of an issue. This hack as you call it just tries to overcome the error by invoking different string-writing mechanism (on their side) and things should be same as soon as calling Jackson marshal method ends.

I see no point in providing an option. It's hardly an option for Camel users. It's either fixed, or not.

Also we need unit tests to go with that thanks.

It cannot be unit-tested. It's integration issue.

If you don't want this fix (which I can understand), feel free to close it. I'll then try to create a fix for Jackson itself. But I doubt they will accept it...

@davsclaus
Copy link
Contributor

@davsclaus I agree that it should be fixed in Jackson. Sadly, it hasn't been fixed for a long time and they don't consider it much of an issue. This hack as you call it just tries to overcome the error by invoking different string-writing mechanism (on their side) and things should be same as soon as calling Jackson marshal method ends.

I see no point in providing an option. It's hardly an option for Camel users. It's either fixed, or not.

I am sorry but we cannot risk affecting everyone else with some code change that Jackson itself does have out of the box, and its been outstanding since 2015. Would such a change cause unforseen changes to existing users, performance degration or what else.

Are there other reported tickets at Jackson than this single instance?

Also we need unit tests to go with that thanks.

It cannot be unit-tested. It's integration issue.

What we ask is that there is a junit test class in the PR that tests this code change.

If you don't want this fix (which I can understand), feel free to close it. I'll then try to create a fix for Jackson itself. But I doubt they will accept it...

Maybe take more time to investigate if there are other reports on Jackson. But in the mean time there is only a solution you can do and that is on the Camel side. But that requires more work which you then must step up and do.

And it would require a JIRA ticket in the Apache Camel JIRA at
https://issues.apache.org/jira/projects/CAMEL/

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

I am sorry but we cannot risk affecting everyone else with some code change that Jackson itself does have out of the box, and its been outstanding since 2015. Would such a change cause unforseen changes to existing users, performance degration or what else.

I'm not sure if we talk about the same thing here. Jackson offers two string generators - UTF8JsonGenerator (which we are using at the moment and which contains the error) and WriterBasedJsonGenerator (which seems to handle 4-byte characters fine). I don't see switching from one to another risky for Camel users, since they should behave exactly the same (not speaking about the error itself).

@rnetuka rnetuka force-pushed the csb-4781 branch 4 times, most recently from 5ffcae6 to 0bc57cc Compare September 11, 2024 12:22
@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

@davsclaus I've added a test for the fix. Also, the Camel JIRA issue has been created.

Copy link
Contributor

@orpiske orpiske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely need to close the OutputStreamWriter ... waiting for clarifications.

@orpiske
Copy link
Contributor

orpiske commented Sep 11, 2024

BTW, I am only reviewing the technicalities here. @davsclaus points are a lot more important than my notes.

@rnetuka rnetuka changed the title Fix for Jackson marshalling 4-byte characters [CAMEL-21199] Camel-jackson not properly marshalling 4-byte characters Sep 11, 2024
@rnetuka rnetuka requested a review from orpiske September 11, 2024 13:44
@rnetuka rnetuka requested a review from davsclaus September 11, 2024 14:05
@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 11, 2024

@orpiske That's fine. Thak you very much for your review!

@davsclaus
Copy link
Contributor

We use unicode escape to have foregin symbols in the camel source code.

See for example
https://github.com/apache/camel/blob/main/core/camel-core/src/test/java/org/apache/camel/component/file/FileConvertBodyToUTF8Test.java

Can you see if you can replace those to escaped values in this PR unit test.

@Test
public void testMarshal4ByteCharacter() throws Exception {
Map<String, Object> in = new HashMap<>();
in.put("name", "システム\uD867\uDE3D");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use unicode escape to have foregin symbols in the camel source code.

See for example
https://github.com/apache/camel/blob/main/core/camel-core/src/test/java/org/apache/camel/component/file/FileConvertBodyToUTF8Test.java

Can you see if you can replace those to escaped values in this PR unit test.

@davsclaus
Copy link
Contributor

Added option useWriter in this PR
#15538

@davsclaus
Copy link
Contributor

On main branch we have the new option useWriter that can be enabled for this use-cases.

@davsclaus
Copy link
Contributor

@rnetuka what is the status of this ?

@rnetuka
Copy link
Contributor Author

rnetuka commented Sep 18, 2024

PR for Jackson issue 223 has been merged: FasterXML/jackson-core#1335. There will be a new Jackson option to combine utf-16 surrogate characters into a one unicode char in their UTF8JsonGenerator.

I'm closing this PR and will create a replacement with the new option being supported.

@rnetuka rnetuka closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants