Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cryptographic hash functions (SHA1 and SHA2 family, MD5, etc.) #1116

Open
cipriancraciun opened this issue Mar 21, 2016 · 12 comments
Open

Comments

@cipriancraciun
Copy link

It would be useful to have as builtins some cryptographic hash functions (like say the SHA1 and SHA2 family, MD5, etc.) (and even some HMAC functions).

The main use-cases for such a feature would be for example:

  • generating document identifiers when jq is used as a pre-processor for loading some JSON object streams in a document oriented database like CouchDB, MongoDB, etc.;
  • consistent hashing (i.e. buckets) as in group_by(.key | sha1 | [:2]);

I have quickly hacked a proof-of-concept based on the latest release (1.5): https://github.com/cipriancraciun/jq/tree/patches/sha1 , see also the diff at the link bellow:
a5b5cbe...cipriancraciun:patches/sha1

If this is deemed useful I could provide the implementation for all these based on OpenSSL or GnuTLS.

@wtlangford
Copy link
Contributor

I like the idea. @nicowilliams- what say you?

Unfortunately, it would mean either compiling in openssl/etc as a hard dependency or making it an optional feature during compilation. Either works, I suppose.

Of course, we could decide we finally want executable library loading (dlopen, LoadLibrary), and implement that. But dlopen and friends makes me very sad.

@jb55
Copy link

jb55 commented Feb 15, 2018

I was looking to do this today. I have an array of objects that I wanted to fingerprint. something like:

$ echo '[{"x": {"a": "a"}}, {"x": {"b": 3}}, {"x": {"c": "c"}}]' | \
    jq '.[] |= . + {xhashed: .x | tostring}'

[
  {
    "x": {
      "a": "a"
    },
    "xhashed": "{\"a\":\"a\"}"
  },
  {
    "x": {
      "b": 3
    },
    "xhashed": "{\"b\":3}"
  },
  {
    "x": {
      "c": "c"
    },
    "xhashed": "{\"c\":\"c\"}"
  }
]

where tostring could be sha1 or sha256

One thing to keep in mind is that you would need to canonicalize the object representation into some standard way before hashing. https://github.com/substack/json-stable-stringify comes to mind.

If we're worried about pulling in dependencies, we could just use a micro-lib from clibs or ccan such as https://github.com/jb55/sha256.c

@jb55
Copy link

jb55 commented Feb 15, 2018

I also noticed that when you do tostring on an object, even with the --sorted-keys (-S) option, it still produces a string with unsorted keys. Bug?

@cipriancraciun
Copy link
Author

Since I've opened this topic I've created a new branch with new "extensions", like one can see in the following examples and tests:

https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_examples
https://github.com/cipriancraciun/jq/tree/patches/extensions/src/_extensions/_tests

The branch is at:

https://github.com/cipriancraciun/jq/tree/patches/extensions

Now to answer @jb55 question: my crypto functions (MD5 and SHA family) come in two variants as seen in:

https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto_builtins.h
https://github.com/cipriancraciun/jq/blob/patches/extensions/src/_extensions/jqe_crypto.c

The crypto_sha256 accepts any JSON value, calls jv_dump_string (in which in the end calls jv_dump_term) with JV_PRINT_ASCII | JV_PRINT_SORTED , and then hashes it, thus if the jv_dump_term is implemented correctly should "canonize" the input.

While the crypto_sha256_ll takes an extra argument which says if the value should be expected a string, or if it should be transformed into a string like the previous function does.

See the following example:

jq '{
        value : .,
        value_hash : . | crypto_sha256,
        tostring : . | tostring,
        tostring_hash : . | tostring | crypto_sha256_ll (false),
    }' <<'EOS'
{"a" : 1, "b" : 2}
{"b" : 2, "a" : 1}
"a"
["a"]
EOS

@pkoppstein
Copy link
Contributor

jq allows a single name to be used for different defs with different arities. Why not just md5 and sha256?

Btw, the prefix . | can be dropped in each case.

@cipriancraciun
Copy link
Author

Why not just md5 and sha256?

First I didn't want to "trash" the namespace, because perhaps some day these functions would have been introduced in jq itself.

And why the two different named functions (i.e. crypto_md5 and crypto_md5_ll)? Mainly because I followed this pattern in the other "extensions", where there was a big difference between the "usual" and "low-level" ones.

Btw, the prefix . | can be dropped in each case.

I know, but I always like to put it just for "sake of completion", because (from a functional language point of view) it is equivalent to function(argument), else "it looks" to me that just saying field : function actually initializes the field with the value of the function or another constant.

@nicowilliams
Copy link
Contributor

Yes, we should do this. Questions:

  • what hash (and MAC) function implementations should we use?

    One obvious answer is: use the code from various IETF RFCs. This won't be very optimized (at all, really), and may have timing side channels to worry about. (Speaking of timing side channels, we should have a constant-time string comparison function. In any case, we should not recommend jq for cryptographic security applications.)

  • should we have an option to use OpenSSL?

    Probably, but we should also have an option to use statically-linked, in-tree implementations wherever possible, and not by including an entire copy of OpenSSL in-tree as we did for Oniguruma.

Also, we're going to need a base64 decoder. And, really, we need a binary "type" -- basically pretending that binary data is actually an array of small numbers (0..255, naturally). We don't want to be base64 coding all the time.

@rmetzler
Copy link

For base64 decoder see #47

This works for me, but for my use case I need base64decode + md5 (calculating fingerprints of public ssh keys).

@jonathanwiesel
Copy link

This looks like a great idea, is it still planned?

@meticulo3366
Copy link

meticulo3366 commented Nov 15, 2022

Can we try and implement this :) We have @base64 today which is pretty useful, some kind of md5 hash would be amazing

Documentation and examples with @base64 are below https://stedolan.github.io/jq/manual/#example71

@wader
Copy link
Member

wader commented Nov 15, 2022

Hi, fq has support for some hash functions if you want to play around:

$ echo -n hello | fq -rRs 'tomd5|tohex'
5d41402abc4b2a76b9719d911017c592
$ echo -n hello | md5
5d41402abc4b2a76b9719d911017c592

(the reason you also need tohex is because tomd5 returns a binary type)

@tianon
Copy link

tianon commented Dec 3, 2024

I think #1931 is also related -- the use case I'm looking at is that I've got a JSON document with some base64 data in it and a sha256 digest of that (potentially binary/non-UTF8) data, and I'd love a way to validate the base64 matches the checksum without having to round trip outside jq (because every round trip out of jq is ~expensive).

So for my use case, I'd need either a solution to #1931 or direct-base64-consuming variants of the checksum functions. 😞

(For more about the use case, see the data+digest fields of https://github.com/opencontainers/image-spec/blob/v1.1.0/descriptor.md)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants