There was a recent question on Haskell-cafe about how to parallelize a simple “rule engine”. I spent a while investigating it, and the solution is both pretty and fast. And my response on haskell-cafe is really messy because gmail is silly. So here is my response, cleaned up, in literate Haskell:

> {-# LANGUAGE RankNTypes #-} > >importqualifiedData.MemoCombinatorsasMemo >importqualifiedData.SetasSet >importControl.Parallel (par) >importqualifiedControl.Parallel.StrategiesasPar >importData.Monoid (Monoid(..)) >importControl.Monad.State >importqualifiedData.DListasDList >

First, I want to capture the idea of a generative set like you’re doing. GenSet is like a set, with the constructor “genset x xs” which says “if x is in the set, then so are xs”.

I’ll represent it as a stateful computation of the list of things in the set, threading the set of things we’ve seen so far. It’s redundant information, but sets can’t be consumed lazily, thus the list (the set will follow along lazily :-).

Remember that State s a is just the function (s -> (s,a)). So we’re taking the set of things we’ve seen so far, and returning the new elements added and the set unioned with those elements.

>newtypeGenSet a > = GenSet (State (Set.Set a) (DList.DList a)) > > genset :: (Ord a) => a -> GenSet a -> GenSet a > genset x (GenSet f) = GenSet $do> seen <- gets (x `Set.member`) >ifseen >thenreturn mempty >elsefmap (DList.cons x) $ > modify (Set.insert x) >> f > > toList :: GenSet a -> [a] > toList (GenSet f) = DList.toList $ evalState f Set.empty

GenSet is a monoid, where mappend is just union.

>instance(Ord a) => Monoid (GenSet a)where> mempty = GenSet (return mempty) > mappend (GenSet a) (GenSet b) = > GenSet (liftM2 mappend a b)

Okay, so that’s how we avoid exponential behavior when traversing the tree. We can now just toss around GenSets like they’re sets and everything will be peachy.

Here’s the heart of the algorithm: the reduce function. To avoid recomputation of rules, we could just memoize the rule function. But we’ll do something a little more clever. The function we’ll memoize (“parf”) first sparks a thread computing its *last* child. Because the search is depth-first, it will typically be a while until we get to the last one, so we benefit from the spark (you don’t want to spark a thread computing something you’re about to compute anyway).

> reduce :: (Ord a) => Memo.Memo a -> (a -> [a]) -> a -> [a] > reduce memo f x = toList (makeSet x) >where> makeSet x = genset x . mconcat . map makeSet . f' $ x > f' = memo parf > parf a =letch = f ain> ch `seq` (f' (last ch) `par` ch)

The ch `seq` is there so that the evaluation of ch and last ch aren’t competing with each other.

Your example had a few problems. You said the rule was supposed to be expensive, but yours was cheap. Also, [x-1,x-2,x-3] are all very near each other, so it’s hard to go do unrelated stuff. I made a fake expensive function before computing the neighbors, and tossed around some prime numbers to scatter the space more.

> rule :: Int -> [Int] > rule n = expensive `seq` > [next 311 4, next 109 577, next 919 353] >where> next x y = (x * n + y) `mod` 5000 > expensive = sum [1..50*n] > > main :: IO () > main =do>letr = reduce Memo.integral rule 1 > print (length r)

The results are quite promising:

% ghc --make -O2 rules2 -threaded % time ./rules2 5000 ./rules2 13.25s user 0.08s system 99% cpu 13.396 total % time ./rules2 +RTS -N2 5000 ./rules2 +RTS -N2 12.52s user 0.30s system 159% cpu 8.015 total

That’s 40% decrease in running time! Woot! I’d love to see what it does on a machine with more than 2 cores.

RE: What it does with more than 2 core

time ./rules2 +RTS -N1

real 0m10.462s user 0m10.345s sys 0m0.120s

time ./rules2 +RTS -N2

real 0m5.455s user 0m9.249s sys 0m0.388s

time ./rules2 +RTS -N4

real 0m3.870s user 0m10.477s sys 0m0.344s

N3 and N4 were roughly the same.

With my current working version of GHC, and a couple of flags, I can do a bit better (this is a slower machine, but has 8 cores). It starts to flatten out around 5 cores:

time ./par-rewrite +RTS -N1

5000

17.34s real 17.25s user 0.09s system 99% ./par-rewrite +RTS -N1

time ./par-rewrite +RTS -N4 -qg0 -qb

5000

5.01s real 19.85s user 0.06s system 397% ./par-rewrite +RTS -N4 -qg0 -qb

time ./par-rewrite +RTS -N8 -qg0 -qb

5000

3.93s real 30.21s user 0.12s system 772% ./par-rewrite +RTS -N8 -qg0 -qb

that’s a 4.4-fold speedup – not bad. The flags tell the GC to do parallel GC in generation 0 but not to do load-balancing: load-balancing is often bad for cache locality in parallel programs.