I was reading about evaluation strategies for lambda calculus on Wikipedia, and one caught my eye: “call-by-future”. The idea behind this strategy is that you evaluate the function and its argument in parallel. It’s like lazy evaluation, except you start evaluating earlier. Call by future semantics are non-strict, so I figured I could coerce Haskell to doing it.

Template Haskell to the rescue again! I wrote a module which has this function:

parApp :: (a -> b) -> (a -> b) parApp f x = x `par` f x

`par` is a special function which says to start evaluating its first argument and then return its second argument. Then it walks the syntax tree, replacing applications with applications of parApp, like:

fib 0 = 0 fib 1 = 1 fib n = fib (n-1) + fib (n-2)

Becomes:

fib 0 = 0 fib 1 = 1 fib n = parApp (parApp (+) (parApp fib (parApp (parApp (-) n) 1)) (parApp (parApp (-) n) 2))

Pretty, ain’t it? `:-p`

For the above program (`test.hs`) computing `fib 40`, when compiled with `-threaded -O2` I get the following results on a dual quad-core machine:

# command time speedup incr.speedup ./test +RTS -N1 # 20.508s 1.00 N/A ./test +RTS -N2 # 18.445s 1.11 0.56 ./test +RTS -N3 # 15.944s 1.29 0.77 ./test +RTS -N4 # 12.690s 1.58 0.94 ./test +RTS -N5 # 11.305s 1.81 0.90 ./test +RTS -N6 # 9.663s 2.12 0.97 ./test +RTS -N7 # 8.964s 2.29 0.92 ./test +RTS -N8 # 8.541s 2.40 0.92

The number after `-N` is the number of hardware threads used, the speedup is the ratio of the original speed against the multicore speed (in an ideal world this would match the number of hardware threads), and the incremental speedup is *(t _{n-1}N_{n-1})/(t_{n}N_{n})*, i.e. the fraction of what we gained

*over the previous simulation*versus what we should have in an ideal world. As long as this is near one, our time is decreasing linearly. As we can see, we pay a lot of overhead mostly at 2 and 3 processors, and after that there is little additional overhead. There is too little data here to see what the large-scale trend is, though.

Suppose that the incremental speedup goes to a constant *p < 1*. Then *t _{n+1} = n/(n+1) * t_{n}/p*. Doing a little algebra, we see that the time levels off at

*n = p/(1-p)*. So, for example, if

*p*were 0.92 as it looks like it’s going to be, then

*n = 11.5*. That is, adding a 12th processor is not going to gain you anything. The long term goal for parallelization is to get solutions where

*p*is very close to 1, because processors are going to start being cheap like hard drives, and I actually wouldn’t surprised if in 24 years, 64-cores were common (calculated using Moore’s law).

So, 8 processors, 2.4x speedup. We could certainly ask for better. But it ain’t bad considering you didn’t have to know anything about your program at all to get it there :-).