Parallel Rewrite System

There was a recent question on Haskell-cafe about how to parallelize a simple “rule engine”. I spent a while investigating it, and the solution is both pretty and fast. And my response on haskell-cafe is really messy because gmail is silly. So here is my response, cleaned up, in literate Haskell:

> {-# LANGUAGE RankNTypes #-}
> 
> import qualified Data.MemoCombinators as Memo
> import qualified Data.Set as Set
> import Control.Parallel (par)
> import qualified Control.Parallel.Strategies as Par
> import Data.Monoid (Monoid(..))
> import Control.Monad.State
> import qualified Data.DList as DList
> 

First, I want to capture the idea of a generative set like you’re doing. GenSet is like a set, with the constructor “genset x xs” which says “if x is in the set, then so are xs”.

I’ll represent it as a stateful computation of the list of things in the set, threading the set of things we’ve seen so far. It’s redundant information, but sets can’t be consumed lazily, thus the list (the set will follow along lazily :-).

Remember that State s a is just the function (s -> (s,a)). So we’re taking the set of things we’ve seen so far, and returning the new elements added and the set unioned with those elements.

> newtype GenSet a 
>       = GenSet (State (Set.Set a) (DList.DList a))
> 
> genset :: (Ord a) => a -> GenSet a -> GenSet a
> genset x (GenSet f) = GenSet $ do
>     seen <- gets (x `Set.member`)
>     if seen
>         then return mempty
>         else fmap (DList.cons x) $ 
>                    modify (Set.insert x) >> f
> 
> toList :: GenSet a -> [a]
> toList (GenSet f) = DList.toList $ evalState f Set.empty

GenSet is a monoid, where mappend is just union.

> instance (Ord a) => Monoid (GenSet a) where
>     mempty = GenSet (return mempty)
>     mappend (GenSet a) (GenSet b) = 
>                  GenSet (liftM2 mappend a b)

Okay, so that’s how we avoid exponential behavior when traversing the tree. We can now just toss around GenSets like they’re sets and everything will be peachy.

Here’s the heart of the algorithm: the reduce function. To avoid recomputation of rules, we could just memoize the rule function. But we’ll do something a little more clever. The function we’ll memoize (“parf”) first sparks a thread computing its *last* child. Because the search is depth-first, it will typically be a while until we get to the last one, so we benefit from the spark (you don’t want to spark a thread computing something you’re about to compute anyway).

> reduce :: (Ord a) => Memo.Memo a -> (a -> [a]) -> a -> [a]
> reduce memo f x = toList (makeSet x)
>     where
>     makeSet x = genset x . mconcat . map makeSet . f' $ x
>     f' = memo parf
>     parf a = let ch = f a in 
>              ch `seq` (f' (last ch) `par` ch)

The ch `seq` is there so that the evaluation of ch and last ch aren’t competing with each other.

Your example had a few problems. You said the rule was supposed to be expensive, but yours was cheap. Also, [x-1,x-2,x-3] are all very near each other, so it’s hard to go do unrelated stuff. I made a fake expensive function before computing the neighbors, and tossed around some prime numbers to scatter the space more.

> rule :: Int -> [Int]
> rule n = expensive `seq` 
>            [next 311 4, next 109 577, next 919 353]
>     where
>     next x y = (x * n + y) `mod` 5000
>     expensive = sum [1..50*n]
> 
> main :: IO ()
> main = do
>     let r = reduce Memo.integral rule 1
>     print (length r)

The results are quite promising:

% ghc --make -O2 rules2 -threaded
% time ./rules2
5000
./rules2  13.25s user 0.08s system 99% cpu 13.396 total
% time ./rules2 +RTS -N2
5000
./rules2 +RTS -N2  12.52s user 0.30s system 159% cpu 8.015 total

That’s 40% decrease in running time! Woot! I’d love to see what it does on a machine with more than 2 cores.

About these ads

2 thoughts on “Parallel Rewrite System

  1. RE: What it does with more than 2 core

    time ./rules2 +RTS -N1
    real 0m10.462s user 0m10.345s sys 0m0.120s

    time ./rules2 +RTS -N2
    real 0m5.455s user 0m9.249s sys 0m0.388s

    time ./rules2 +RTS -N4
    real 0m3.870s user 0m10.477s sys 0m0.344s

    N3 and N4 were roughly the same.

  2. With my current working version of GHC, and a couple of flags, I can do a bit better (this is a slower machine, but has 8 cores). It starts to flatten out around 5 cores:

    time ./par-rewrite +RTS -N1
    5000
    17.34s real 17.25s user 0.09s system 99% ./par-rewrite +RTS -N1

    time ./par-rewrite +RTS -N4 -qg0 -qb
    5000
    5.01s real 19.85s user 0.06s system 397% ./par-rewrite +RTS -N4 -qg0 -qb

    time ./par-rewrite +RTS -N8 -qg0 -qb
    5000
    3.93s real 30.21s user 0.12s system 772% ./par-rewrite +RTS -N8 -qg0 -qb

    that’s a 4.4-fold speedup – not bad. The flags tell the GC to do parallel GC in generation 0 but not to do load-balancing: load-balancing is often bad for cache locality in parallel programs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s