Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded support #787

Closed
weaversa opened this issue Jun 27, 2020 · 11 comments
Closed

Multithreaded support #787

weaversa opened this issue Jun 27, 2020 · 11 comments
Labels
feature request Asking for new or improved functionality question Not a task, but rather a question or discussion topic

Comments

@weaversa
Copy link
Collaborator

Is there anything holding back a multithreaded Cryptol interpreter? I think about this every time I write a map.

@robdockins robdockins added feature request Asking for new or improved functionality question Not a task, but rather a question or discussion topic labels Jun 29, 2020
@robdockins
Copy link
Contributor

I guess it wouldn't actually be that hard to add some multithreaded support to the interpreter. I'm slightly worried that we might discover some race conditions in the interpreter, but given the pure nature of Cryptol itself, we might actually be in good shape.

I think the bigger question is: will there be enough inherent parallelism in the algorithms we care about to make it worthwhile? Research in automatic parallelism for functional languages has been fairly disappointing, as I recall. We might be able to add a parallel map primitive to let the user be more explicit about where to fork parallel tasks... for purely data-parallel applications this might be worthwhile.

@weaversa
Copy link
Collaborator Author

weaversa commented Jul 1, 2020

I do have something specific in mind ---building the Merkle tree for PQ Merkle signature schemes is intense, but trivially parallelizable with the parallel map primitive you suggest.

@robdockins
Copy link
Contributor

I was curious what it would take to do this; turns out it isn't super hard (at least for the concrete evaluator). Some very initial, but encouraging results:

$ time $CRYPTOL examples/AES.cry -c 'sum (map (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])'
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m14.903s
user    0m14.754s
sys     0m0.136s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N1
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m15.926s
user    0m15.774s
sys     0m0.139s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N4
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.537s
user    0m16.920s
sys     0m0.278s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N8
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.064s
user    0m28.482s
sys     0m0.823s

Speedup is clearly less than linear for this function, but is pretty substantial nonetheless.

@weaversa
Copy link
Collaborator Author

weaversa commented Jul 1, 2020

I wonder if some of the time is spent computing the key schedule. That only needs to happen once here, then it can be reused for every of the 101 runs.

Your parmap is great!

Now, for the second part, can you do, say, a million encrypts? Or, does Cryptol run out of memory?

The final part would be, obviously (@atomb), for the AES here to be a native function.

@robdockins
Copy link
Contributor

I tried something a bit more modest first: 10000 encrypts. This has been slowly growing memory usage and the machine is getting sluggish, so that's not necessarily great. I guess we'd probably be better off using a more explicit thread pool rather than just spawning all the threads and letting GHC's runtime deal with it.

@robdockins
Copy link
Contributor

A little more playing around and I discovered that moving to a more map-reduce style significantly improved the memory usage.

import AES

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

This version seems to run in basically constant space and fills up as many cores as I give it. Hypothesis: reducing partial values early allows the memory consumed by the parallel tasks to be collected, whereas a big sequence doesn't. We might be able to improve the internals somehow to release memory once the value is computed, but it isn't immediately clear how.

@weaversa
Copy link
Collaborator Author

weaversa commented Jul 2, 2020

A good challenge would be to build a Merkle tree. It could be parmap (with your trick) is perfect but that Cryptol objects just use too much memory to represent such a thing for reasonable parameters. So that may be another ticket.

@robdockins
Copy link
Contributor

For posterity, here is the simple examples I've been experimenting with:

import AES

mapReduce :
  {chunkSize, a, b, n}
  (fin n, fin chunkSize, chunkSize >= 1) =>
  (a -> b) -> (b -> b -> b) -> b -> [n]a -> b
mapReduce f g u xs = foldl g z zs
  where
  red : {m} (fin m) => [m]b -> b
  red = foldl g u

  y   : [ n%chunkSize ] a
  ys  : [ n - n%chunkSize ] a
  yss : [ n / chunkSize ][chunkSize] a

  (y,ys) = splitAt`{ n % chunkSize } xs
  yss = groupBy`{chunkSize} ys

  z  = red (parmap f y)
  zs = map (\x -> red (parmap f x)) yss


sumEnc xs = sum (parmap (\x -> (aesEncrypt (x, zero))) xs)

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

sumEnc3 = mapReduce`{100} (\x -> aesEncrypt (x,zero)) (+) 0

sumEnc has rather poor memory usage as the number of elements scales. After some time, the processor usage would periodically stall and the machine would swap for a while before saturating the cores again; I interpret this as major garbage collection events. I eventually had to cancel largeish jobs using this method before they finished.

This seems to be fixed in both sumEnc2 and sumEnc3, which appear to have constant memory usage. They were both able to successfully run with an input argument of [1 .. 100000], which takes several hours to run. The job seemed to entirely saturate the 4 cores I allocated for the job the entire time. GC pauses appeared to be minimal and the machine did not have swapping issues.

@robdockins
Copy link
Contributor

The parmap primitive in now included in master as of #792, which should allow some additional experimentation.

@robdockins
Copy link
Contributor

In #868, I've updated parmap to reduce its arguments to normal form (NF) rather than just to weak head normal form (WHNF) and used a strictly-accumulating version of sum. With this done, the performance difference between the naive sumEnc above and the tricker mapReduce version basically disappears (mapReduce still has a minor advantage, but it is quite small).

@robdockins
Copy link
Contributor

With #867 and #868 merged, I think we can consider this closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Asking for new or improved functionality question Not a task, but rather a question or discussion topic
Projects
None yet
Development

No branches or pull requests

2 participants