Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dlopen(), random, I/O, eval, and co-expressions (co-routines) #1843

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

nicowilliams
Copy link
Contributor

@nicowilliams nicowilliams commented Feb 25, 2019

This PR adds:

  • access to the system RNG on *nix (via /dev/urandom) and Windows (via CNG)
  • dynamically-loadable C-coded builtin support for modules
  • support for "builtin modules" named "jq/..."
  • support for "file handles"
  • a builtin "jq/io" module that have more advanced I/O facilities using file handles (the user must use the -I/--io command-line option to enable this)
  • a builtin "jq/proc" module that has bindings for popen(3) and system(3) (the user must use the -I/--io command-line option to enable this)
  • unwind(protect) -- basically a finally
  • fixes try/catch catches more than it should #1859 and ER: parsing of try ... catch ... #1608, and introduces a try-catch-finally as well
  • eval
  • co-expressions (co-routines)

See README.plugins.md for details on how to write C-coded functions for jq modules.

Windows builds, but who knows if it works.

TBD:

  • add support for some sort of authorization policy for I/O (sandbox)
  • code review

@coveralls
Copy link

coveralls commented Feb 25, 2019

Coverage Status

Coverage decreased (-15.5%) to 68.633% when pulling f339f1d on nicowilliams:dlopen into e7d3798 on stedolan:master.

@nicowilliams nicowilliams changed the title WIP: add dlopen support Add dlopen support Feb 26, 2019
@nicowilliams nicowilliams changed the title Add dlopen support WIP: Add dlopen support Feb 26, 2019
@nicowilliams
Copy link
Contributor Author

nicowilliams commented Mar 5, 2019

I just wrote a little tool to extract function names and types from DWARF dumps, scripts/dwarf2h. I'm using it to generate the typedefs, v-tables, macros, and such for the plugin system.

We probably don't want to run this from the build, as it would add a dependency on llvm-6.0, though maybe in a clang build we could do it. Besides, we want to have a struct per-ABI number... So generating this will have to be a manual step to be run whenever a jv or jq function is added that plugins should be able to call.

@nicowilliams nicowilliams force-pushed the dlopen branch 4 times, most recently from 935d382 to bc4e6a4 Compare March 6, 2019 20:16
@nicowilliams nicowilliams changed the title WIP: Add dlopen support Add dlopen support Mar 6, 2019
@nicowilliams nicowilliams force-pushed the dlopen branch 3 times, most recently from 68c0890 to a781366 Compare March 7, 2019 05:15
@nicowilliams
Copy link
Contributor Author

@elric1 here's dlopen() functionality for jq. @muhmuhten's work made this easier.

@muhmuhten maybe you'd like to take a look? Now we're not far from being able to have builtin modules.

@joelpurra I notice you have a jq-coded PRNG. Maybe now you can have a proper RNG. tests/modules/somod/ has a trivial function to read random numbers from /dev/urandom, and it wouldn't be hard to make this better and work on Windows, and then make a builtin proper.

@nicowilliams
Copy link
Contributor Author

This needs a way to say: don't export the C-coded builtins. Then we could actually have something like C-coded builtins that provide file I/O with file handles and all. The jq-coded wrappers can use try/catch to correctly cleanup on scope exit. We could have something like:

def appendfile(output):
    _openappend as $fh
  | try _writefh($fh; output) catch (_closefh($fh),error);

Here the file handle does not even leak, so this is very idiomatic. The value of $fh need not even be unpredictable, since _writefh wouldn't be exposed, but if file handle values are unpredictable then we can export writefh and have something of a with_open_files:

def with_open_files(body):
    [.[]|_open] as $fhs
  | $fhs
  | try body catch (($fhs[] | _closefh),error);

  [{name:"foo",write:true},{name:"bar",append:true}]
| with_open_files(. as $fhs | "hello world" | (writefh($fhs[0]), writefh($fhs[1]))

@nicowilliams nicowilliams force-pushed the dlopen branch 2 times, most recently from f49f803 to 2ffa9d6 Compare March 7, 2019 06:27
@nicowilliams
Copy link
Contributor Author

And now we can haz private C-coded builtins.

@muhmuhten
Copy link
Contributor

Huh. I would've expected private C-coded functions by tossing the corresponding C module on the libs list separately from the module's jq code. Then you'd always bind the jq code to the importer, you'd always bind the C module to the jq part, and optionally also bind the C functions to the importer based on the modulemeta.

Granted, that gets less export granularity, but pulls the export-or-not check up two levels. Not actually sure whether that's an improvement...

Another thought would be to add a significant "exported" key to the modulemeta and throw out non-exported top-level functions in bind_block_referenced where it sort of thematically belongs. The exported flag on cfunctions can be automatically added to that, perhaps.

@nicowilliams
Copy link
Contributor Author

@muhmuhten Interesting idea. That would complicate the linker code a bit, since I'd have to add a pseudo-library with a name that cannot be imported in the middle of loading the shared object. The granularity, I think, is nice. I'll think about it.

@muhmuhten
Copy link
Contributor

muhmuhten commented Mar 8, 2019

Third one sounded nice in my head but the real benefits out of it (unexported jq functions, building a list of exported functions for modulemeta to use) probably can't be realized until the builtins are a real module. I was looking into resolving library linking not being able to drop unuseds by deferring all binding until the end but that's a bit stalled on wrapping my head around import-as.

dynamically-loaded modules as a solution for the I/O problem looks pretty neat, actually.

(Seems like the easiest such interface to implement would basically be explicitly throwing around file descriptors, though, or reinventing them by passing around indices into a C-side array to make FILE pointers json-representable. That's not necessarily a bad thing if we can write multiple modules with the same high-level interface though.)

@nicowilliams
Copy link
Contributor Author

Right, I'd want to use an object like so as a file handle: {kind:..., index:N, verifier:...}. The idea is that it should not be possible to guess a file handle value. The index would be an index into an array -- just like a file descriptor, or, in some cases, actually a file descriptor.

For co-routines the handle namespace would have be shared by all the related jq_states, but not global.

@nicowilliams
Copy link
Contributor Author

Actually, I too dislike the new bindflag I added. I think instead I'll also add a hidden field to inst, and then mark all non-exported functions as hidden after the block_bind_self() step, and block_bind_subblock_inner() would be changed to ignore binders marked hidden.

(We could also not include non-exported functions from a library's defs blocks, but then we'd have to count references to insts via bound_by so that we don't leak in block_free(). I'd rather not do this.)

@nicowilliams nicowilliams force-pushed the dlopen branch 2 times, most recently from fb024cc to 46fb339 Compare March 9, 2019 20:16
@nicowilliams
Copy link
Contributor Author

I should point out that this has become a collaboration. @leonid-s-usov isn't just reviewing the code, but producing fixes. For example, in his branch he took my one-opcode-bytecoded-function-inliner and generalized it to support inlining of larger functions and even of functions that have params! That is truly fantastic! (Though, obviously, because of the 16-bit instruction count limit, this has to be used sparingly, and will be.)

@nicowilliams
Copy link
Contributor Author

So the interesting case that UNWINDING (which will get renamed) creates is that we now will have this sort of thing:

lhs | try(stuff | protect(write_footer) | more_stuff; error_handler) | outside

and we want write_footer to run when backtracking normally, when more_stuff raises, and when outside raises, but we don't want error_handler to run if only outside raises. And we also have to consider the possibility that write_footer raises an error itself.

So, PROTECT and protect/1 will re-raise the error it catches (if it catches one) if the protect handler does not itself raise an error. And if the protect handler does raise an error, then PROTECT/protect/1 will let that one take over.

leonid-s-usov and others added 6 commits January 29, 2020 22:21
when considering to add RET_JQ at the end of a block.
currently two instructions backtrack: BACKTRACK and TAIL_OUT
In case a path gets deleted, we should iterate arrays backwards in
path/1 context.
@bb010g
Copy link

bb010g commented Apr 9, 2022

Could the non-I/O parts of this PR be split out to another PR that would be easier to review & get merged?

@thaliaarchi
Copy link
Contributor

What's blocking this PR? Does it have unfinished features? Does it have difficult merge conflicts? If it's too large too review, I could take a stab at splitting it into smaller, more focused PRs, via careful git rebase (I've done this kind of thing often).

I'd also suggest an execve API matching the system API. Then the argument parsing could be done by jq and passed raw without shell parsing.

@nicowilliams
Copy link
Contributor Author

What's blocking this PR? Does it have unfinished features? Does it have difficult merge conflicts? If it's too large too review, I could take a stab at splitting it into smaller, more focused PRs, via careful git rebase (I've done this kind of thing often).

Mainly I think we need a careful review of the design (i.e., the signatures and semantics of the new builtins) and the code too, of course. And I need the energy to finish it, or someone who has that energy to step in.

I'd also suggest an execve API matching the system API. Then the argument parsing could be done by jq and passed raw without shell parsing.

Oh yes, that's on my list of wants, and all of posix_spawn(), because I have this silly idea that one could build a shell in jq, and move all the jq command-line options parsing into jq code.

@thaliaarchi
Copy link
Contributor

random, randombytes/1, and randomstring/1 seem to be functionally separated from the rest. They don't depend on the IO permission system and seem self-contained. Could that be split off? With your go-ahead, I'd be interested in taking that on.

I'm well familiar with the internals MT19937, the PRNG used by Python and many others. Although it's not cryptographically secure, I have context from its API design.

random/0 produces a random int with 51 bits of precision. IEEE-754 (and thus JavaScript, JSON, and jq) float64 has 53 bits of precision, so the mask remainder bits should be changed from 0x7 to 0x1f. I assume those two missing bits were a typo. Since jq doesn't actually deal with ints, I think this should be renamed to randomint/0.

randombytes/1 looks fine. If jq arrays can have a pre-allocated capacity, that would make it faster, though.

randomstring/1 seems dubious to me. It tries to treat each codepoint as its own random unit, but its generated codepoints are in the wrong range. It takes two bytes as a uint16 from the random buffer and encodes them as a codepoint. This generates codepoints in the range U+0000–U+FFFF. It should really generate in the range U+0000–U+D7FF and U+E000–U+10FFFF, to exclude surrogate halves and include codepoints outside the Basic Multilingual Plane.

I think the API should be:

  • randfloat: float in the range [0, 1) with 53 bits of precision
  • randint: non-negative integer in the range [0, 2^53) (the current random/0 with fixes)
  • randint(max): non-negative integer in the range [0, max)
  • randint(min; max): integer in the range [min, max)
  • randstring(len): string of random valid UTF-8 codepoints (the current randomstring/1 with fixes)
  • randbytes(len): array of random integers in the range [0, 256) (the current randombytes/1)

I think that randfloat would actually be the focal point of the random API, not randint, since jq is float-first for numbers. For reference, Python implements this in random.random, originally sourced from and mt19937ar in genrand_res53. (Note that they combine the uint32 halves with double arithmetic to avoid uint64, since the code is old, but we can just use uint64.)

Would it make sense to have a random API for jq's big decimals? How would that work with gojq?

With it using /dev/urandom, it's more like Go crypto/rand, but I hope to get a nice API like Go math/rand.

Outside of random, I think there's plenty of room to improve the UTF-8 and byte APIs. I've been thinking about this in other contexts, so I have lots of thoughts here. If that would be welcome, I could open an issue.

If splitting random off into a separate PR and polishing it would be welcome, I'd be willing.

@thaliaarchi
Copy link
Contributor

Besides random, the rest seems tightly intertwined. eval defers to coeval, which uses coexpressions and requires IO permission. I didn't review the plugin system as closely, so I haven't determined how it's connected.

Why is eval defined in terms of coeval? I see that COEVAL creates a new jq instance, which I would assume to be fairly expensive. Since I assume eval wouldn't need to be concurrent, could it parse the filter expression in the current environment?

Would coexpressions be useful broadly outside working with file handles? It might be able to be isolated from the IO changes. With the amount of new syntax it introduces, I think it would benefit from its own PR, to be able to discuss its syntax and semantics.

@nicowilliams
Copy link
Contributor Author

Besides random, the rest seems tightly intertwined. eval defers to coeval, which uses coexpressions and requires IO permission. I didn't review the plugin system as closely, so I haven't determined how it's connected.

Pretty much. I suppose eval shouldn't need permission.

One thing I've wondered is whether we should try to do a Haskell IO monad like thing where only the main program can "do I/O", and all modules that want to do I/O need to get utility closures from the main program. But... in jq that would just be very unwieldy.

Why is eval defined in terms of coeval? I see that COEVAL creates a new jq instance, which I would assume to be fairly expensive. Since I assume eval wouldn't need to be concurrent, could it parse the filter expression in the current environment?

To eval we need to compile and interpret the program. I took a short-cut and simply re-used the existing compiler and VM machinery, and so eval just... runs that machinery for the given program. And as it happens that's also the easiest way to get co-routines implemented, so the two share this.

Would coexpressions be useful broadly outside working with file handles? It might be able to be isolated from the IO changes. With the amount of new syntax it introduces, I think it would benefit from its own PR, to be able to discuss its syntax and semantics.

Any time you want breadth-first recursive traversal you'll need coexpressions. Long ago when I used Icon I rarely used coexpressions, so they might not be that necessary most of the time.

@nicowilliams
Copy link
Contributor Author

nicowilliams commented Feb 8, 2024

Also, consider some options for implementing eval:

  1. write an interpreter in jq (jqjq style)
  2. compile the given program as usual but link it into the currently running program so as to reuse the existing VM
  3. compile the given program as usual and run it in a new VM
  4. a complete re-write that compiles to native code or something

(1) would be too slow.
(2) is reasonable, but since I was already doing (3) to make co-routines possible, I went with (3).
(4) is a great idea for someone with the time and energy to take it on.

@nicowilliams
Copy link
Contributor Author

random/0 produces a random int with 51 bits of precision. IEEE-754 (and thus JavaScript, JSON, and jq) float64 has 53 bits of precision, so the mask remainder bits should be changed from 0x7 to 0x1f. I assume those two missing bits were a typo. Since jq doesn't actually deal with ints, I think this should be renamed to randomint/0.

Well, jq 1.7 does have something of a bignum feature, so we could make this better. I agree that it should probably be named randomnum/0 or randomint/0.

@nicowilliams
Copy link
Contributor Author

randombytes/1 looks fine. If jq arrays can have a pre-allocated capacity, that would make it faster, though.

It would be good to finish the binary support branch and make randombytes/1 output binary.

randomstring/1 seems dubious to me. It tries to treat each codepoint as its own random unit, but its generated codepoints are in the wrong range. It takes two bytes as a uint16 from the random buffer and encodes them as a codepoint. This generates codepoints in the range U+0000–U+FFFF. It should really generate in the range U+0000–U+D7FF and U+E000–U+10FFFF, to exclude surrogate halves and include codepoints outside the Basic Multilingual Plane.

I agree. I should remove it completely.

@nicowilliams
Copy link
Contributor Author

Outside of random, I think there's plenty of room to improve the UTF-8 and byte APIs. I've been thinking about this in other contexts, so I have lots of thoughts here. If that would be welcome, I could open an issue.

I've a branch that adds binary support :) fq-style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

try/catch catches more than it should