QR Text and Binary Encodings #62

kousu · 2019-03-31T21:35:23Z

There is something weird going on with text encodings. I've spent half a day trying to read the QR spec and the ECI spec to make sense of it and I'm a bit lost, so I'm not surprised it is difficult to implement correctly, but I know there's something off with TextDecoder::Append.

There's a comment in the code, copied verbatim from the old fork

The spec isn't clear on this mode; see
section 6.4.5: t does not say which encoding to assuming
upon decoding. I have seen ISO-8859-1 used as well as
Shift_JIS -- without anything like an ECI designator to
give a hint.

This bug is about this comment and the confusion in the QR spec around this issue.

If I use qrencode like this to encode a binary file:

curl "https://sampleswap.org//samples-ghost/DRUM%20LOOPS%20and%20BREAKS/161%20to%20180%20bpm/128\[kb\]161_amenvar3.aif.mp3" | qrencode -S -v 20 -8 -o /tmp/amen.png

one that includes embedded nulls all over the place:

$ curl "https://sampleswap.org//samples-ghost/DRUM%20LOOPS%20and%20BREAKS/161%20to%20180%20bpm/128\[kb\]161_amenvar3.aif.mp3" | xxd
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8427  100  8427    0     0  42467      0 --:--:-- --:--:-- --:--:-- 42560
00000000: fff3 80c4 0000 0000 0000 0000 0058 696e  .............Xin
00000010: 6700 0000 0f00 0000 3c00 0020 eb00 040d  g.......<.. ....
00000020: 0d11 1115 1919 1d1d 2225 252a 2a2d 3131  ........"%%**-11
00000030: 3535 383c 3c40 4046 4b4b 4f4f 5356 565a  558<<@@FKKOOSVVZ
00000040: 5a5e 6868 7272 7579 797d 7d81 8484 8888  Z^hhrruyy}}.....
00000050: 8c90 9094 9498 9b9b 9f9f a3a7 a7ac acb0  ................
00000060: b5b5 b9b9 bdc1 c1c6 c6ca cdcd d1d1 d5d9  ................
00000070: d9dd dde2 e5e5 e9e9 edf1 f1f5 f5f9 ffff  ................
00000080: ff00 0000 394c 414d 4533 2e39 3972 0269  ....9LAME3.99r.i
00000090: 0000 0000 2e2e 0000 1428 2404 1f42 0000  .........($..B..
000000a0: 2800 0020 eb38 55ff ec00 0000 0000 0000  (.. .8U.........
...

And I upload the pieces to your demo, the reader succeeds:

But scan_png garbles the header:

$ ./scan_png /tmp/amen-01.png | xxd
00000000: 5465 7874 3a20 2020 2020 c3bf c3b3 c280  Text:     ......
00000010: c384 0000 0000 0000 0000 0058 696e 6700  ...........Xing.
00000020: 0000 0f00 0000 3c00 0020 c3ab 0004 0d0d  ......<.. ......
00000030: 1111 1519 191d 1d22 2525 2a2a 2d31 3135  ......."%%**-115
00000040: 3538 3c3c 4040 464b 4b4f 4f53 5656 5a5a  58<<@@FKKOOSVVZZ
00000050: 5e68 6872 7275 7979 7d7d c281 c284 c284  ^hhrruyy}}......
00000060: c288 c288 c28c c290 c290 c294 c294 c298  ................
00000070: c29b c29b c29f c29f c2a3 c2a7 c2a7 c2ac  ................
00000080: c2ac c2b0 c2b5 c2b5 c2b9 c2b9 c2bd c381  ................
00000090: c381 c386 c386 c38a c38d c38d c391 c391  ................
000000a0: c395 c399 c399 c39d c39d c3a2 c3a5 c3a5  ................
000000b0: c3a9 c3a9 c3ad c3b1 c3b1 c3b5 c3b5 c3b9  ................

I can't tell if the QR Spec supports binary or just text encoded as binary. It's obvious to me that they were thinking mainly about textual data, but maybe they allowed others too? My experiment demonstrates that you can encode binary, and why shouldn't you be able to? QR includes length headers and their marketting even explicitly advertises a binary mode ("Numeric, Alphanumeric, Binary, Kanji").

Tracing shows that QRDecoder::DecodeByteSegment(), in the absence of an explicit ECI being set, tries to guess

https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/qrcode/QRDecoder.cpp#L151-L165

and that will fall back to ISO8859-1:

https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextDecoder.cpp#L503-L506

which explicitly flags the next step to just copy without interpretation:

https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextDecoder.cpp#L227-L240

The rest of TextDecoder appears to store the output in directly as unicode (that's what wstrings are?). I guess ISO8859-1 must be a strict subset of unicode, so it doesn't need rewriting, and therefore can double, by accident as binary mode, so long as you later on know to interpret the wstring as bytes.

The QR Spec says

8.3.1: The default interpretation for QR Code is ECI 000020 representing the JIS8 and Shift JIS character sets.
8.4.4: In [8-bit Byte Mode], one 8 bit codeword directly represents the JIS8 character [...].
In ECIs other than the default ECI, it represents an 8-bit byte value directly.

I interpret this to mean inTextDecoder::Append(), the CharacterSet::Unknown case should be aliased to CharacterSet::Shift_JIS and in DecodeByteSegment(), only CharacterSet::Unknown or CharacterSet::Shift_JIS should call TextDecoder::Append(), but that in the case of "8-bit Byte mode" Shift-JIS should actually trigger JIS8. And otherwise, the bytes should be passed through undecoded. I don't know where other character sets are supposed to be allowed through. This all sounds insane, and I need help interpreting what is going on.

If that's the case, TextDecoder::Append()

That comment in DecodeByteSegment that I quoted is not correct: section 6.4.5 doesn't exist in the QR spec; section 8.3.4 and 8.4.4 is where "8-bit Byte mode" is described.

I suppose we could just, as a community, decide that binary mode is CharacterSet::Unknown. afaict that's what qrencode has already done.

If that's the case so, DecodeByteSegment() should be changed to not GuessEncoding() but just copy the data directly to the output.

There's this other annoying issue that QR Codes can come in mixed modes, with text in different character sets in the same code, but binary is not text.

The old fork's solution to handling multiple character sets was to coerce everything to UTF-8 as they went:

https://github.com/glassechidna/zxing-cpp/blob/e0e40ddec63f38405aca5c8c1ff60b85ec8b1f10/core/src/zxing/qrcode/decoder/DecodedBitStreamParser.cpp#L70-L74

(which ends badly when iconv chokes on non-textual data)

Yours is to coerce everything to unencoded unicode, calling it text.

But binary isn't text at all! It's something else. This type confusion is probably reason scan_png failed.

Related, several functions like

https://github.com/nu-book/zxing-cpp/blob/549e2e8e4b492c9752adff296d4a44c6cd876693/core/src/TextUtfEncoding.cpp#L268

conflate fixed-width UCS-2 (in uint16_t*) with variable-width UTF-16. For 99% of all test cases in practice, this is correct, for now, because no one ever writes unicode text that forces utf-16 to overflow into its variable width encoding. But it's not correct, and it's a lurking bug that will bite years down the road.

Thanks for reading. I know this was a long, windy, confusing bug report. Encoding issues are the worst thing.

Refs:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

The text was updated successfully, but these errors were encountered:

huycn · 2019-05-02T02:28:52Z

@kousu The reason scan_png garbles the header is that it does TextUtfEncoding::ToUtf8() before printing text result. I emphasize text since the scan_png only get text result and lines 151 to 165 in QRDecoder.cpp, we try to get text out of binary data.

If you want binary data, I believe the best way to get it is to call result.metadata().getByteArrayList(ResultMetadata::BYTE_SEGMENTS) to get a list of byte sequence and concatenate them.

For guessing encoding, as far as we don't consider binary data (which is always available as metadata), the current behavior still works well. What do you think?

huycn · 2019-05-02T02:44:48Z

@kousu Also on your last paragraph, you mention an issue in TextUtfEncoding::AppendUtf16() but I still don't catch it yet.

If wchar_t is 32 bits as on Unix, we handle surrogates, thus there is no issue. On systems where wchar_t is only 16 bits, we have no choice but rely on client to interpret wchar_t buffer as utf16 (and it works well on Windows). We could replace all surrogate pairs by a replacement char, but that means it theses pairs will not show as they should on systems that support utf16 (like Windows).

axxel · 2020-09-18T08:43:42Z

Is this still an issue or can we close it?

kousu · 2020-09-22T06:38:59Z

Uhh sorry the notification from github got lost in my inbox last year.

I assume this is still an issue if nothing has been done for it. I think it's good to hew to the specs if you can. I realize UTF and QR are two semi-incompatible specs for encoding non-ASCII characters so there's going to be some friction either way. But I'm not using zxing this year so I don't have a horse in the race anymore.

axxel · 2020-09-22T07:04:06Z

Thanks for the feedback.

axxel · 2022-05-21T17:04:26Z

With the latest additions you can now call ZXingReader -binary <some-file.png> and have it output the unaltered binary content of all found symbols (tested only for QRcode and DataMatrix at the moment). I can't test it with this particular input, since the linked mp3 file is not available anymore.

This was referenced Mar 31, 2019

Add possibility to get barcode-details in python lubo/zxinglight#4

Open

Enable binary decoding. glassechidna/zxing-cpp#80

Open

axxel closed this as completed Sep 22, 2020

axxel mentioned this issue May 13, 2022

How to improve binary data support? (Community feedback requested) #334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QR Text and Binary Encodings #62

QR Text and Binary Encodings #62

kousu commented Mar 31, 2019

huycn commented May 2, 2019 •

edited

Loading

huycn commented May 2, 2019

axxel commented Sep 18, 2020

kousu commented Sep 22, 2020

axxel commented Sep 22, 2020

axxel commented May 21, 2022

QR Text and Binary Encodings #62

QR Text and Binary Encodings #62

Comments

kousu commented Mar 31, 2019

huycn commented May 2, 2019 • edited Loading

huycn commented May 2, 2019

axxel commented Sep 18, 2020

kousu commented Sep 22, 2020

axxel commented Sep 22, 2020

axxel commented May 21, 2022

huycn commented May 2, 2019 •

edited

Loading