-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QR Text and Binary Encodings #62
Comments
@kousu The reason scan_png garbles the header is that it does If you want binary data, I believe the best way to get it is to call For guessing encoding, as far as we don't consider binary data (which is always available as metadata), the current behavior still works well. What do you think? |
@kousu Also on your last paragraph, you mention an issue in If wchar_t is 32 bits as on Unix, we handle surrogates, thus there is no issue. On systems where wchar_t is only 16 bits, we have no choice but rely on client to interpret wchar_t buffer as utf16 (and it works well on Windows). We could replace all surrogate pairs by a replacement char, but that means it theses pairs will not show as they should on systems that support utf16 (like Windows). |
Is this still an issue or can we close it? |
Uhh sorry the notification from github got lost in my inbox last year. I assume this is still an issue if nothing has been done for it. I think it's good to hew to the specs if you can. I realize UTF and QR are two semi-incompatible specs for encoding non-ASCII characters so there's going to be some friction either way. But I'm not using zxing this year so I don't have a horse in the race anymore. |
Thanks for the feedback. |
With the latest additions you can now call |
There is something weird going on with text encodings. I've spent half a day trying to read the QR spec and the ECI spec to make sense of it and I'm a bit lost, so I'm not surprised it is difficult to implement correctly, but I know there's something off with TextDecoder::Append.
There's a comment in the code, copied verbatim from the old fork
This bug is about this comment and the confusion in the QR spec around this issue.
If I use
qrencode
like this to encode a binary file:one that includes embedded nulls all over the place:
And I upload the pieces to your demo, the reader succeeds:
But
scan_png
garbles the header:I can't tell if the QR Spec supports binary or just text encoded as binary. It's obvious to me that they were thinking mainly about textual data, but maybe they allowed others too? My experiment demonstrates that you can encode binary, and why shouldn't you be able to? QR includes length headers and their marketting even explicitly advertises a binary mode ("Numeric, Alphanumeric, Binary, Kanji").
Tracing shows that QRDecoder::DecodeByteSegment(), in the absence of an explicit ECI being set, tries to guess
https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/qrcode/QRDecoder.cpp#L151-L165
and that will fall back to ISO8859-1:
https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextDecoder.cpp#L503-L506
which explicitly flags the next step to just copy without interpretation:
https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextDecoder.cpp#L227-L240
The rest of TextDecoder appears to store the output in directly as unicode (that's what wstrings are?). I guess ISO8859-1 must be a strict subset of unicode, so it doesn't need rewriting, and therefore can double, by accident as binary mode, so long as you later on know to interpret the
wstring
as bytes.The QR Spec says
I interpret this to mean in
TextDecoder::Append()
, theCharacterSet::Unknown
case should be aliased toCharacterSet::Shift_JIS
and inDecodeByteSegment()
, onlyCharacterSet::Unknown
orCharacterSet::Shift_JIS
should callTextDecoder::Append()
, but that in the case of "8-bit Byte mode" Shift-JIS should actually trigger JIS8. And otherwise, the bytes should be passed through undecoded. I don't know where other character sets are supposed to be allowed through. This all sounds insane, and I need help interpreting what is going on.If that's the case, TextDecoder::Append()
That comment in DecodeByteSegment that I quoted is not correct: section 6.4.5 doesn't exist in the QR spec; section 8.3.4 and 8.4.4 is where "8-bit Byte mode" is described.
I suppose we could just, as a community, decide that binary mode is CharacterSet::Unknown. afaict that's what
qrencode
has already done.If that's the case so,
DecodeByteSegment()
should be changed to notGuessEncoding()
but just copy the data directly to the output.There's this other annoying issue that QR Codes can come in mixed modes, with text in different character sets in the same code, but binary is not text.
The old fork's solution to handling multiple character sets was to coerce everything to UTF-8 as they went:
https://github.com/glassechidna/zxing-cpp/blob/e0e40ddec63f38405aca5c8c1ff60b85ec8b1f10/core/src/zxing/qrcode/decoder/DecodedBitStreamParser.cpp#L70-L74
(which ends badly when iconv chokes on non-textual data)
Yours is to coerce everything to unencoded unicode, calling it text.
But binary isn't text at all! It's something else. This type confusion is probably reason
scan_png
failed.Related, several functions like
https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextUtfEncoding.cpp#L268
conflate fixed-width UCS-2 (in
uint16_t*
) with variable-width UTF-16. For 99% of all test cases in practice, this is correct, for now, because no one ever writes unicode text that forces utf-16 to overflow into its variable width encoding. But it's not correct, and it's a lurking bug that will bite years down the road.Thanks for reading. I know this was a long, windy, confusing bug report. Encoding issues are the worst thing.
Refs:
The text was updated successfully, but these errors were encountered: