-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct UTF-8 processing across whole infrastucture #144
Comments
I did spend a few UTF8 tests in |
This one looks alot like your attempts in test144.scxml, does it pass with your platform? |
Still passes as Otherwise I cannot see how I would debug this. My assumption is that I can use std::string as it merely serves as a container for the character encoding and ECMAScript uses UCS-2 anyway? |
If you replace both UTF |
There is some kind of implicit conversion from UTF-8 encoded scxml into windows-1250 chars passed into JS context. That's the 0xFC byte seen on the screenshot. This matches the windows-1250 mapping according to wiki. Now, when changing scxml file's encoding into windows (both on bytes level and preamble level i.e. Subsequent attempts to get data's content via Even if the change of document's encoding into single-byte windows-1250 would somehow help, it would not be good enough due to limited character coverage of such encoding. For instance such encoding would not be capable of storing both ü character and some character from Greek alphabet. I don't know internals of MacOS, specifically its internal default encoding. Maybe even the LLVM plays some role in the difference we both see. Could you run your test in debug mode and inspect the bytes being passed into JSC DM when processing ü character? Maybe they are proper UTF-8 bytes in your platform. |
I too think that std::string can serve as an universal container for bytes, even UTF-8. ECMA's internal encoding should be UTF-16. Big endian, presumably. The conversion should be done by the However I'm not sure whether there is enough of assurance that bytes in std::strings being passed into JSC DM are really without any doubt UTF-8. The Xerces methods being used in the DOM construction during init seem to be insufficient for such assurance on all platforms. |
One last discovery from win platform: when setting variable's value to ü via JSC unicode UTF-16 notation |
I will try to force the Win7 installation in my VM to a german codepage and reproduce. |
Codepage 1250 is not supported by the xercesc parser per default. You can try to build xercesc with ICU and see whether this makes a difference. Or you can use UTF-8 or UTF-16 to author SCXML documents. However, Codepage 1252 is supposedly supported without ICU and I'll try to see whether there is something missing to support it. |
The XML encoded in CP 1250 was only for test. Normally I encode all XMLs in UTF-8 with BOM and with Doing so, debugger says that Xerces internally stores chars from this UTF-8 source file in UTF-16. After all Xerces internal char processing uses type XMLCh evaluated as wchar_t in WIN32. That's good, we don't lose any char coverage between UTF-8 <-> UTF-16 conversion IMO. The char cropping (UTF-8/16 -> single byte ANSI CP) seems to take place in uscxml DOM. There is a method Symbol CP_ACP in its parameters stands for The system default Windows ANSI code page. This value can be different on different computers, even on the same network. It can be changed on the same computer, leading to stored data becoming irrecoverably corrupted. This value is only intended for temporary use and permanent storage should use UTF-16 or UTF-8 if possible. I did not inspect DOM class code fully. Besides _localForm there is also _unicodeForm variable. What's the purpose of this duality? Is the _unicodeForm intended for manipulation on the Xerces level only? Perhaps value of the _localForm can be set to UTF-8 form by custom conversion method effectively bypassing Xerces usage of WideCharToMultiByte with CP_ACP. At least on WIN32 platform. |
I did add some tests for |
Ok, I added tests for |
I tried to run following tests.
All of them fail on my system. In all cases the character is passed as a single 0xFC byte into datamodel. I really think the whole issue is related to Xerces somehow. When running test application test-utf8 where there is no XML parsing (data is passed into DM via literals in I'll try to play a little with Xerces build with ICU. I'm already using ICU for JavaScriptCore build so there would be no extra dependency for me. |
As for the tests, it seems to be a good idea to extend their test strings into a form where there are characters from more than one single-byte codepage. The reason is OS can provide a single-byte codepage covering ü char but the same CP cannot simultaneously cover Hebrew Arabic char ش (from CP WINDOWS-1256) for example. Comparsion of strings üش would thus be more conclusive for proving that UTF-8 is really used for processing. |
I did add considerably more complicated encoding tests. Please tell me whether they pass with your locale. |
The new tests all passed, more precisely the comparisons evaluated positively. Unfortunately, the unicode values not covered by default (Windows locale dependent) ANSI codepage got malformed during processing. For example all hiragana characters were stored in DOM as default codepage character "?". This is normal behaviour of Windows function After building Xerces with However, after some research in Xerces's ICU usage I found that default ICU converter set by Xerces uses default platform specific codepage as a conversion target. This is set in Xerces's method Could you debug unicode test in your Win VM and inspect the bytes used for storage of hiragana chars? I doubt they are stored in UTF-8. They are probably malformed into default CP chars before comparison. I'm not sure how to proceed now regarding an universal uscxml unicode handling working correctly on all platforms. The ICU way works for me but it involves non-standard build of Xerces, moreover verification of correct behaviour on other platforms would be necessary. My previous attempt to use std::codecvt template on Windows works too but there may be some unresolved issues like compatility with C89 DM. |
Maybe you could clarify with the xerces-c maintainers about the correct usage of ICU? I can't imagine that you'd have to edit their source as xerces-c is as old as XML and this particular implementation is used on many platforms and programming languages. So there ought to be more to it. |
I am going to close this. As far as I can see, there is no issue if you write your XML documents in one of the various encodings supported by Xerces-C by default. |
Hi, the problem exists and issue should be reopen. Steps to reproduce:Conditions:Windows any version, codepage 1251 Actions:
Results:
char* tmp = XERCESC_NS::XMLString::transcode(toTranscode);
_localForm = std::string(tmp); After conversion we have Later we perform JSStringRef scriptJS = JSStringCreateWithUTF8CString(expr.c_str());
JSValueRef exception = NULL;
JSValueRef result = JSEvaluateScript(_ctx, scriptJS, NULL, NULL, 0, &exception); And we are getting hard exception here because we have different size. Possible Solutions
|
2. Execute expressions immediately
Setting value of JSC datamodel variable to "ü" and calling JSC DM equality operator <log expr="variable == 'ü'"/> yields null value for logger. When processing SCXML DOM during interpreter initialization these expressions are stored in platform's native ASCII one-byte encoding. At least on Windows. When expressions are later evaluated using method JSCDataModel::evalAsValue and more precisely method JSStringCreateWithUTF8CString they fail to transfer into JSString because source encoding is not UTF-8. JSC library throws an exception in this case.
I somehow managed to solve this problem by modifying methods X(const XMLCh* const toTranscode) and X(const std::string& fromTranscode) in DOM.h for WIN32 specific behaviour: storing content of variable _localForm always in UTF-8 chars and _unicodeForm always in UTF-16 big endian. After doing so, I can successfully run aforementioned DM evaluations and some other things. Please see changes in the file DOM.h in my fork.
This whole thing gets more complicated when interoperability with bindings is needed. I partially solved this issue for csharp bindings. It can be found in this patch.
The test is also ready in UTF-8 (with BOM) encoded scxml file. But you did already make one yourself.
What seems as insoluble related issue is correct UTF-8 output to std::cout from StdOutLogger on Windows. But this seems to be caused by quirks of Windows console.
The text was updated successfully, but these errors were encountered: