I have briefly—or perhaps not so briefly, my interests being unpredictable beasts—shifted the focus of my research to lazy specialization. This topic was discussed extensively, but not comprehensively, in Michael Jonathan Thyer’s dissertation, Lazy Specialization, and seems to have received little attention otherwise. This is unfortunate. I consider it a very important topic.

Lazy specialization is the idea that a functional program be incrementally optimized during its execution, in tandem with partial application. For example, consider the code:

let g = f x in (g y, g y')

A lazy specializer would evaluate g as far as it could using only information about x, and then apply the resulting optimized function to y and y’.

If most of f’s work could be done using x alone, you could save a lot of time this way. That is the obvious appeal. The drawback on the other side is that the evaluation machinery of a lazy specializer is more involved than a whnf evaluator, so the whole system takes a speed hit as a result.

But so far as I can tell, the only such machines in existence are in a high-level, symbolic, proof-of-concept form, in which the hit is several orders of magnitude. Where’s the research getting down and dirty with this technology, seeing if we can bring it within a factor of 10 or less? Let me argue why we would want to pursue this seriously:

Here is a quote by Heinrich Apfelmus on laziness, at the end of his article on lazy sorting:

Well, it’s highly unlikely that algorithms get faster by introducing laziness. I mean, lazy evaluation means to evaluate only those things that are really needed and any good algorithm will be formulated in a way such that the unnecessary things have already been stripped off. But laziness allows to simplify and compose algorithms. Sometimes, seemingly different algorithms turn out to be two sides of the same coin when formulated with lazy evaluation. Isn’t it great that finding the k-th minimum is not only an adaption of quicksort but can readily be obtained from it by

composingit with(!! k)?

I’d say that data structures are the sink for most of the creativity of lazy functional programmers. We spend lots of energy getting a data structure to represent exactly the things we need, sharing what needs to be shared. The key point about data structures is that they can be decomposed; you can peel off the root of a tree and leave yourself with only one of the branches, and the other branch will be garbage collected.

A lazy specializer promotes full-blown functions to the level of data structures. Functions can now be decomposed (by composing!) in the same way, sharing important pieces and forgetting unimportant ones. Observe this stream implementation:

type Stream a = Nat -> a cons x xs z = case z of { Zero -> x; Succ z' -> xs z' } tail s z = s (Succ z)

Where Nat is the standard lazy natural type (constructors Zero and Succ). What happens when we evaluate `tail (cons 1 (cons 2 R))`, for some stream R (I don’t want to bother creating an infinite stream). I’ll use the HNF strategy that I discussed previously. Let’s watch:

tail (cons 1 (cons 2 R)) (\s z. s (z+1)) (cons 1 (cons 2 R)) (\z. cons 1 (cons 2 R) (Succ z)) (\z. (\z'. case z' of { Zero -> 1; Succ z' -> cons 2 R z' }) (Succ z)) (\z. case Succ z of { Zero -> 1; Succ z' -> cons 2 R z' }) (\z. cons 2 R z) ...

Where did the 1 go? Gone, collected by the garbage collector, just as if we had written Stream in the more conventional way. GHC would have held on to the 1 forever. Nat -> a behaves *exactly* like the Stream data structure. The structure of Nat has *induced* a structure for Nat -> a. That is a beautiful form of data structure composability.

Working in a language that uses a lazy specializer also comes with a very important mental advantage: abstracting a pattern using a higher-order function always has constant-time overhead. I.e. you will never make your program run faster by unfolding abstractions. This encourages the more composable higher-order style, providing a snappy comeback to any speed nazi. It also removes the encouragement to fiddle with the details of a function for more speed: beta-equivalent formulations will all turn into the same code in the end, so clarity is the only remaining variable, and can be maximized.

Ideas like GHC’s `{-# RULES #-}` are no longer fickle compiler constructions, but can now be made trustworthy optimizations. If the left side of a rule is in a certain form (a restricted sort of normal form), then its evaluation can be baked into the interpreter and will be applied as aggressively as possible. It is invariant to fiddling around with higher-order presentations and whatnot; they will be folded away and the rule can be seen. (But I’m not convinced RULES are necessary anymore; perhaps you can simply express the rules you want as a functional construction?)

The most convincing property of lazy specializers to me was the primary bench test for Thyer’s dissertation: the “tower of interpreters” test. This is the property that, if you write an interpreter for a language L, and in language L write an interpreter for M, and in M write an interpreter for N, etc., all the layers will be squashed away and you will end up with code that runs just as fast as it did in the base language (with a *constant time* overhead for eliminating the layers). You can design a language with all the bells and whistles you want, give it a semantics, and you may immediately use without a significant penalty. This is a *huge* abstraction win.

But there are a fair number of open problems in lazy specialization, more than just getting faster specialization machines. The explosion problem is tricky, in which, say, `fix` gradually becomes specialized to things like `\f. f (f (f (f (f (f (f (fix f)))))))`; not useful, just eats up memory. The leap from complete laziness to optimal evaluation needs to be understood better. But in particular, I think we need a real LS to program with for a while, just to see what its dynamics are and to bring out its real-world problems.

I leave you with an addendum (which I snipped from the main argument), explaining my perceived connection between LS and FRP, my personal holy grail. All of my research links back to FRP *somehow*:

Finally, as if you didn’t know this was coming, there is a connection between lazy specializers and FRP, the holy grail of my functional programming life. It is a particular manifestation of the first point I made, designing streams as functions. I figure, all I have to do is design a data structure for time, with the right domain structure, perhaps something like:

data Time = Zero | Next Double Time

And then define our friends:

type Behavior a = Time -> a type Future a = (Time, a)

The implementation types *exactly equal* to their desired semantics, the most beautiful implementation I could ask for. Evaluation for a behavior *b* has a structure like:

sample0 = b Zero b1 = b . Next 0.033 sample1 = b1 Zero b2 = b1 . Next 0.033 sample2 = b2 Zero ...

Each successive b is optimized, the unneeded bits at the front thrown off just as in the Stream example. The reason this can’t be encoded in regular Haskell is because the algebra for FRP includes a value of type `Future (Behavior a)`, and we can’t see inside the Future because it is a functor, and thus arbitrary functions separate us from the enclosed Behavior, about which we must selectively forget things. However, the lazy specializer has no trouble looking through those functions, and will modify the behavior anyway, ridding us of the dreaded space leak.