Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix to_hex function handling of non-ascii characters #672

Merged
merged 1 commit into from
May 16, 2023

Conversation

Rogach
Copy link
Contributor

@Rogach Rogach commented May 15, 2023

to_hex function was expecting string as an input type, which caused the value to be CastFn'ed to string,
which in turn resulted in raw bytes being cast to []rune (in makeDecodeValueOut). This operation replaces invalid UTF-8 bytes with 0xFFFD, which then were passed on to the hex encoder, resulting in incorrect output.

Example:

$ head -c 8 /dev/urandom > random.bin
$ cat random.bin | xxd -p
fb59d12f2f557dbd
$ cat random.bin | gzip > random.bin.gz
$ cat random.bin.gz | fq '.uncompressed | tohex' -r
efbfbd59efbfbd2f2f557defbfbd

This patch fixes the problem by expecting any as an input type for to_hex function, which allows it to correctly read raw bytes of the input data.

@wader
Copy link
Owner

wader commented May 15, 2023

Thanks. Yes that seems wrong and i guess also to_base64 and possibly some other are wrong? Looking at the tests for this i seem i only test the case when having a binary directly and not via a raw "decode value".

One tricky thing is that fq adds a "binary" type to jq but has to be careful to not expose it to "normal jq" , ex type for a binary is string and it tries to behave as a string unless you use some special functions like tobytes, if not all kind of weirdness can happen. So i think you change it correct but i will have to do some meditation if it might affect some other things.

This also reminded me that the bits_format option should probably support hex.

@Rogach
Copy link
Contributor Author

Rogach commented May 15, 2023

i guess also to_base64 and possibly some other are wrong?

I'll look into to_base64 and other functions, expect a PR tomorrow (quick test shows that to_base64 is indeed also affected).

One tricky thing is that fq adds a "binary" type to jq but has to be careful to not expose it to "normal jq" , ex type for a binary is string and it tries to behave as a string unless you use some special functions like tobytes, if not all kind of weirdness can happen.

Yes, I noticed that it might be a bit brittle. Another possible approach would be to add a first-class binary type to gojq (since you are maintaining your own fork anyway), but that might be a big undertaking.

@wader
Copy link
Owner

wader commented May 15, 2023

i guess also to_base64 and possibly some other are wrong?

I'll look into to_base64 and other functions, expect a PR tomorrow (quick test shows that to_base64 is indeed also affected).

Great, no hurry 👍

One tricky thing is that fq adds a "binary" type to jq but has to be careful to not expose it to "normal jq" , ex type for a binary is string and it tries to behave as a string unless you use some special functions like tobytes, if not all kind of weirdness can happen.

Yes, I noticed that it might be a bit brittle. Another possible approach would be to add a first-class binary type to gojq (since you are maintaining your own fork anyway), but that might be a big undertaking.

The gojq fork fq uses extends it with a JQValue interface so that you can add new types (with some limitations) or reimplement existing types (that is how decode value work). And early versions of fq actually was a bit more liberal with exposing binary and decode value types etc but it made existing jq code, like the jq standard library, have bugs or behave in unintuitive ways. But yes there might be better ways or doing things than what fq currently do, so please experiment if you feel like it.

to_hex and to_base64 functions were expecting `string` as an input type,
which caused the value to be CastFn'ed to string,
which in turn resulted in raw bytes being cast to []rune (in makeDecodeValueOut).
This operation replaces invalid UTF-8 bytes with 0xFFFD, which then were
passed on to the hex/base64 encoders, resulting in incorrect output.

This patch fixes it by expecting `any` as an input type,
which allows the function to correctly read raw bytes of the input data.
@Rogach
Copy link
Contributor Author

Rogach commented May 16, 2023

I decided to add changes for to_base64 into this PR, because they are very similar and make sense as a logical unit (please tell me if you prefer two separate PRs instead).

I haven't found any other functions that are affected by the same problem. to_urlencode and to_urlpath also read json string, but there it seems to be correct - URL-encoding is most commonly used to encode valid strings, and URL-encoding invalid UTF8 will probably result in receiving systems handling it incorrectly.

@wader
Copy link
Owner

wader commented May 16, 2023

Ok! one commit is fine and makes sense too me. I did a quick check and didn't find anything else also.

Thanks for good commit message and for adding tests

@wader wader merged commit aec2635 into wader:master May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants