# Emphasizing Specialization

I have briefly—or perhaps not so briefly, my interests being unpredictable beasts—shifted the focus of my research to lazy specialization. This topic was discussed extensively, but not comprehensively, in Michael Jonathan Thyer’s dissertation, Lazy Specialization, and seems to have received little attention otherwise. This is unfortunate. I consider it a very important topic.

Lazy specialization is the idea that a functional program be incrementally optimized during its execution, in tandem with partial application. For example, consider the code:

```    let g = f x in (g y, g y')
```

A lazy specializer would evaluate g as far as it could using only information about x, and then apply the resulting optimized function to y and y’.

If most of f’s work could be done using x alone, you could save a lot of time this way. That is the obvious appeal. The drawback on the other side is that the evaluation machinery of a lazy specializer is more involved than a whnf evaluator, so the whole system takes a speed hit as a result.

But so far as I can tell, the only such machines in existence are in a high-level, symbolic, proof-of-concept form, in which the hit is several orders of magnitude. Where’s the research getting down and dirty with this technology, seeing if we can bring it within a factor of 10 or less? Let me argue why we would want to pursue this seriously:

Here is a quote by Heinrich Apfelmus on laziness, at the end of his article on lazy sorting:

Well, it’s highly unlikely that algorithms get faster by introducing laziness. I mean, lazy evaluation means to evaluate only those things that are really needed and any good algorithm will be formulated in a way such that the unnecessary things have already been stripped off. But laziness allows to simplify and compose algorithms. Sometimes, seemingly different algorithms turn out to be two sides of the same coin when formulated with lazy evaluation. Isn’t it great that finding the k-th minimum is not only an adaption of quicksort but can readily be obtained from it by composing it with (!! k)?

I’d say that data structures are the sink for most of the creativity of lazy functional programmers. We spend lots of energy getting a data structure to represent exactly the things we need, sharing what needs to be shared. The key point about data structures is that they can be decomposed; you can peel off the root of a tree and leave yourself with only one of the branches, and the other branch will be garbage collected.

A lazy specializer promotes full-blown functions to the level of data structures. Functions can now be decomposed (by composing!) in the same way, sharing important pieces and forgetting unimportant ones. Observe this stream implementation:

```    type Stream a = Nat -> a
cons x xs z = case z of { Zero -> x; Succ z' -> xs z' }
tail s z = s (Succ z)
```

Where Nat is the standard lazy natural type (constructors Zero and Succ). What happens when we evaluate tail (cons 1 (cons 2 R)), for some stream R (I don’t want to bother creating an infinite stream). I’ll use the HNF strategy that I discussed previously. Let’s watch:

```   tail (cons 1 (cons 2 R))
(\s z. s (z+1)) (cons 1 (cons 2 R))
(\z. cons 1 (cons 2 R) (Succ z))
(\z. (\z'. case z' of { Zero -> 1; Succ z' -> cons 2 R z' }) (Succ z))
(\z. case Succ z of { Zero -> 1; Succ z' -> cons 2 R z' })
(\z. cons 2 R z)
...
```

Where did the 1 go? Gone, collected by the garbage collector, just as if we had written Stream in the more conventional way. GHC would have held on to the 1 forever. Nat -> a behaves exactly like the Stream data structure. The structure of Nat has induced a structure for Nat -> a. That is a beautiful form of data structure composability.

Working in a language that uses a lazy specializer also comes with a very important mental advantage: abstracting a pattern using a higher-order function always has constant-time overhead. I.e. you will never make your program run faster by unfolding abstractions. This encourages the more composable higher-order style, providing a snappy comeback to any speed nazi. It also removes the encouragement to fiddle with the details of a function for more speed: beta-equivalent formulations will all turn into the same code in the end, so clarity is the only remaining variable, and can be maximized.

Ideas like GHC’s {-# RULES #-} are no longer fickle compiler constructions, but can now be made trustworthy optimizations. If the left side of a rule is in a certain form (a restricted sort of normal form), then its evaluation can be baked into the interpreter and will be applied as aggressively as possible. It is invariant to fiddling around with higher-order presentations and whatnot; they will be folded away and the rule can be seen. (But I’m not convinced RULES are necessary anymore; perhaps you can simply express the rules you want as a functional construction?)

The most convincing property of lazy specializers to me was the primary bench test for Thyer’s dissertation: the “tower of interpreters” test. This is the property that, if you write an interpreter for a language L, and in language L write an interpreter for M, and in M write an interpreter for N, etc., all the layers will be squashed away and you will end up with code that runs just as fast as it did in the base language (with a constant time overhead for eliminating the layers). You can design a language with all the bells and whistles you want, give it a semantics, and you may immediately use without a significant penalty. This is a huge abstraction win.

But there are a fair number of open problems in lazy specialization, more than just getting faster specialization machines. The explosion problem is tricky, in which, say, fix gradually becomes specialized to things like \f. f (f (f (f (f (f (f (fix f))))))); not useful, just eats up memory. The leap from complete laziness to optimal evaluation needs to be understood better. But in particular, I think we need a real LS to program with for a while, just to see what its dynamics are and to bring out its real-world problems.

I leave you with an addendum (which I snipped from the main argument), explaining my perceived connection between LS and FRP, my personal holy grail. All of my research links back to FRP somehow:

Finally, as if you didn’t know this was coming, there is a connection between lazy specializers and FRP, the holy grail of my functional programming life. It is a particular manifestation of the first point I made, designing streams as functions. I figure, all I have to do is design a data structure for time, with the right domain structure, perhaps something like:

```   data Time = Zero | Next Double Time
```

And then define our friends:

```   type Behavior a = Time -> a
type Future a = (Time, a)
```

The implementation types exactly equal to their desired semantics, the most beautiful implementation I could ask for. Evaluation for a behavior b has a structure like:

```   sample0 = b Zero
b1 = b . Next 0.033
sample1 = b1 Zero
b2 = b1 . Next 0.033
sample2 = b2 Zero
...
```

Each successive b is optimized, the unneeded bits at the front thrown off just as in the Stream example. The reason this can’t be encoded in regular Haskell is because the algebra for FRP includes a value of type Future (Behavior a), and we can’t see inside the Future because it is a functor, and thus arbitrary functions separate us from the enclosed Behavior, about which we must selectively forget things. However, the lazy specializer has no trouble looking through those functions, and will modify the behavior anyway, ridding us of the dreaded space leak.

# On the By functions

Here’s a quick little note that has been bugging me for a while, and nobody wants to talk about it right now on IRC.

I think the By functions:

```sortBy :: (a -> a -> Ordering) -> [a] -> [a]
maximumBy :: (a -> a -> Ordering) -> [a] -> a
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
nubBy :: (a -> a -> Bool) -> [a] -> [a]
```

Etc. should be replaced by On functions:

```sortOn :: (Ord b) => (a -> b) -> [a] -> [a]
maximumOn :: (Ord b) => (a -> b) -> [a] -> a
groupOn :: (Eq b) => (a -> b) -> [a] -> [[a]]
nubOn :: (Eq b) => (a -> b) -> [a] -> [a]
```

My argument is: the functions provided to sortBy etc. have some preconditions. sortBy is not well-defined (or oughtn’t be) for functions which are not linear ordering functions; nubBy is shouldn’t be well-defined (to the dismay of some 31337s) for functions which do not encode an equivalence relation. But the folklore is that functions are typically “as total as possible”, so if it wants a function of some type, all I have to do is conjure a function of that type and my output will be something reasonable in terms of that function.

On the other hand, the folklore of typeclasses is that they typically come with laws. You need to prove — or at least think you know how to prove — some laws when you make a type an instance of a typeclass. The On functions use this obligation to encode their precondition. They are easier to use in a verified setting, too; there are a bunch of standard instances of Eq and Ord for which the laws are known to hold; map your data on to that in any way you like and the precondition is guaranteed.

Thoughts?