Multithreaded support #787

weaversa · 2020-06-27T00:42:30Z

Is there anything holding back a multithreaded Cryptol interpreter? I think about this every time I write a map.

The text was updated successfully, but these errors were encountered:

robdockins · 2020-06-29T15:44:01Z

I guess it wouldn't actually be that hard to add some multithreaded support to the interpreter. I'm slightly worried that we might discover some race conditions in the interpreter, but given the pure nature of Cryptol itself, we might actually be in good shape.

I think the bigger question is: will there be enough inherent parallelism in the algorithms we care about to make it worthwhile? Research in automatic parallelism for functional languages has been fairly disappointing, as I recall. We might be able to add a parallel map primitive to let the user be more explicit about where to fork parallel tasks... for purely data-parallel applications this might be worthwhile.

weaversa · 2020-07-01T00:07:13Z

I do have something specific in mind ---building the Merkle tree for PQ Merkle signature schemes is intense, but trivially parallelizable with the parallel map primitive you suggest.

robdockins · 2020-07-01T04:47:25Z

I was curious what it would take to do this; turns out it isn't super hard (at least for the concrete evaluator). Some very initial, but encouraging results:

$ time $CRYPTOL examples/AES.cry -c 'sum (map (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])'
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m14.903s
user    0m14.754s
sys     0m0.136s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N1
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m15.926s
user    0m15.774s
sys     0m0.139s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N4
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.537s
user    0m16.920s
sys     0m0.278s

$ time $CRYPTOL examples/AES.cry -c 'sum (parmap (\x -> (aesEncrypt (x, zero))) [ 0 .. 100 ])' +RTS -N8
Loading module Cryptol
Loading module AES
0x8362b33433e62b81f9a5217bbbb486f9

real    0m4.064s
user    0m28.482s
sys     0m0.823s

Speedup is clearly less than linear for this function, but is pretty substantial nonetheless.

weaversa · 2020-07-01T16:54:21Z

I wonder if some of the time is spent computing the key schedule. That only needs to happen once here, then it can be reused for every of the 101 runs.

Your parmap is great!

Now, for the second part, can you do, say, a million encrypts? Or, does Cryptol run out of memory?

The final part would be, obviously (@atomb), for the AES here to be a native function.

robdockins · 2020-07-01T17:44:14Z

I tried something a bit more modest first: 10000 encrypts. This has been slowly growing memory usage and the machine is getting sluggish, so that's not necessarily great. I guess we'd probably be better off using a more explicit thread pool rather than just spawning all the threads and letting GHC's runtime deal with it.

robdockins · 2020-07-01T23:31:18Z

A little more playing around and I discovered that moving to a more map-reduce style significantly improved the memory usage.

import AES

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

This version seems to run in basically constant space and fills up as many cores as I give it. Hypothesis: reducing partial values early allows the memory consumed by the parallel tasks to be collected, whereas a big sequence doesn't. We might be able to improve the internals somehow to release memory once the value is computed, but it isn't immediately clear how.

weaversa · 2020-07-02T11:55:38Z

A good challenge would be to build a Merkle tree. It could be parmap (with your trick) is perfect but that Cryptol objects just use too much memory to represent such a thing for reasonable parameters. So that may be another ticket.

robdockins · 2020-07-02T22:56:37Z

For posterity, here is the simple examples I've been experimenting with:

import AES

mapReduce :
  {chunkSize, a, b, n}
  (fin n, fin chunkSize, chunkSize >= 1) =>
  (a -> b) -> (b -> b -> b) -> b -> [n]a -> b
mapReduce f g u xs = foldl g z zs
  where
  red : {m} (fin m) => [m]b -> b
  red = foldl g u

  y   : [ n%chunkSize ] a
  ys  : [ n - n%chunkSize ] a
  yss : [ n / chunkSize ][chunkSize] a

  (y,ys) = splitAt`{ n % chunkSize } xs
  yss = groupBy`{chunkSize} ys

  z  = red (parmap f y)
  zs = map (\x -> red (parmap f x)) yss


sumEnc xs = sum (parmap (\x -> (aesEncrypt (x, zero))) xs)

sumEnc2 xs = sum (map sumSegment ys)
  where
  ys = groupBy`{100} xs
  sumSegment y = sum (parmap (\x -> aesEncrypt (x,zero)) y)

sumEnc3 = mapReduce`{100} (\x -> aesEncrypt (x,zero)) (+) 0

sumEnc has rather poor memory usage as the number of elements scales. After some time, the processor usage would periodically stall and the machine would swap for a while before saturating the cores again; I interpret this as major garbage collection events. I eventually had to cancel largeish jobs using this method before they finished.

This seems to be fixed in both sumEnc2 and sumEnc3, which appear to have constant memory usage. They were both able to successfully run with an input argument of [1 .. 100000], which takes several hours to run. The job seemed to entirely saturate the 4 cores I allocated for the job the entire time. GC pauses appeared to be minimal and the machine did not have swapping issues.

robdockins · 2020-07-09T18:22:41Z

The parmap primitive in now included in master as of #792, which should allow some additional experimentation.

robdockins · 2020-09-02T22:41:25Z

In #868, I've updated parmap to reduce its arguments to normal form (NF) rather than just to weak head normal form (WHNF) and used a strictly-accumulating version of sum. With this done, the performance difference between the naive sumEnc above and the tricker mapReduce version basically disappears (mapReduce still has a minor advantage, but it is quite small).

robdockins · 2020-09-17T18:11:33Z

With #867 and #868 merged, I think we can consider this closed.

robdockins added feature request Asking for new or improved functionality question Not a task, but rather a question or discussion topic labels Jun 29, 2020

robdockins mentioned this issue Jul 1, 2020

Implement an experimental parmap operation #792

Merged

robdockins closed this as completed Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded support #787

Multithreaded support #787

weaversa commented Jun 27, 2020

robdockins commented Jun 29, 2020

weaversa commented Jul 1, 2020 •

edited

Loading

robdockins commented Jul 1, 2020

weaversa commented Jul 1, 2020

robdockins commented Jul 1, 2020

robdockins commented Jul 1, 2020

weaversa commented Jul 2, 2020

robdockins commented Jul 2, 2020

robdockins commented Jul 9, 2020

robdockins commented Sep 2, 2020

robdockins commented Sep 17, 2020

Multithreaded support #787

Multithreaded support #787

Comments

weaversa commented Jun 27, 2020

robdockins commented Jun 29, 2020

weaversa commented Jul 1, 2020 • edited Loading

robdockins commented Jul 1, 2020

weaversa commented Jul 1, 2020

robdockins commented Jul 1, 2020

robdockins commented Jul 1, 2020

weaversa commented Jul 2, 2020

robdockins commented Jul 2, 2020

robdockins commented Jul 9, 2020

robdockins commented Sep 2, 2020

robdockins commented Sep 17, 2020

weaversa commented Jul 1, 2020 •

edited

Loading