UTF8 data corruption read from stdin (erroneous substitution by U+FFFD) #2259

ChCyrill · 2021-01-31T10:17:32Z

Describe the bug
jq replaces some valid utf8 characters read from stdin by U+FFFD (internally and in its output).

To Reproduce
There is no U+FFFD in the stdin:
cat test_raw.txt | grep '�'
But there are U+FFFD characters in the output:
cat test_raw.txt | jq --raw-input --raw-output '.' | grep '�'

test_raw.txt

as I understand 'Ё' characters in the example file can be replaced with any multi-byte character and issue will be still reproducible
it is required to use a string with length greater than jq reading buffer size (4096) to reproduce the issue

Expected behavior
jq keeps strings as it is.
In the specified case, produces the same results as the command below (reading not from stdin, but from file):
jq --null-input --raw-output '$f | .' --rawfile f test_raw.txt | grep '�'

Environment:

Arch Linux
jq version 1.6

Additional notes

Looks like the same issue was fixed for reading from files: e84d171

As I found, wrong characters replacing occurs here: https://github.com/stedolan/jq/blob/master/src/util.c:
value = jv_string_concat(value, jv_string_sized(state->buf, state->buf_valid_len));
state->buf[0] = '\0';
state->buf_valid_len = 0;

Probably, jq_util_read_more function reads only part of the utf8 character to the buffer (truncation by the 4096 buffer size leads to incomplete character reading).

The text was updated successfully, but these errors were encountered:

Maxdamantus mentioned this issue May 30, 2021

Support binary strings, preserve UTF-8 and UTF-16 errors #2314

Open

itchyny added the bug label Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 data corruption read from stdin (erroneous substitution by U+FFFD) #2259

UTF8 data corruption read from stdin (erroneous substitution by U+FFFD) #2259

ChCyrill commented Jan 31, 2021

UTF8 data corruption read from stdin (erroneous substitution by U+FFFD) #2259

UTF8 data corruption read from stdin (erroneous substitution by U+FFFD) #2259

Comments

ChCyrill commented Jan 31, 2021