# Emphasizing Specialization

I have briefly—or perhaps not so briefly, my interests being unpredictable beasts—shifted the focus of my research to lazy specialization. This topic was discussed extensively, but not comprehensively, in Michael Jonathan Thyer’s dissertation, Lazy Specialization, and seems to have received little attention otherwise. This is unfortunate. I consider it a very important topic.

Lazy specialization is the idea that a functional program be incrementally optimized during its execution, in tandem with partial application. For example, consider the code:

```    let g = f x in (g y, g y')
```

A lazy specializer would evaluate g as far as it could using only information about x, and then apply the resulting optimized function to y and y’.

If most of f’s work could be done using x alone, you could save a lot of time this way. That is the obvious appeal. The drawback on the other side is that the evaluation machinery of a lazy specializer is more involved than a whnf evaluator, so the whole system takes a speed hit as a result.

But so far as I can tell, the only such machines in existence are in a high-level, symbolic, proof-of-concept form, in which the hit is several orders of magnitude. Where’s the research getting down and dirty with this technology, seeing if we can bring it within a factor of 10 or less? Let me argue why we would want to pursue this seriously:

Here is a quote by Heinrich Apfelmus on laziness, at the end of his article on lazy sorting:

Well, it’s highly unlikely that algorithms get faster by introducing laziness. I mean, lazy evaluation means to evaluate only those things that are really needed and any good algorithm will be formulated in a way such that the unnecessary things have already been stripped off. But laziness allows to simplify and compose algorithms. Sometimes, seemingly different algorithms turn out to be two sides of the same coin when formulated with lazy evaluation. Isn’t it great that finding the k-th minimum is not only an adaption of quicksort but can readily be obtained from it by composing it with (!! k)?

I’d say that data structures are the sink for most of the creativity of lazy functional programmers. We spend lots of energy getting a data structure to represent exactly the things we need, sharing what needs to be shared. The key point about data structures is that they can be decomposed; you can peel off the root of a tree and leave yourself with only one of the branches, and the other branch will be garbage collected.

A lazy specializer promotes full-blown functions to the level of data structures. Functions can now be decomposed (by composing!) in the same way, sharing important pieces and forgetting unimportant ones. Observe this stream implementation:

```    type Stream a = Nat -> a
cons x xs z = case z of { Zero -> x; Succ z' -> xs z' }
tail s z = s (Succ z)
```

Where Nat is the standard lazy natural type (constructors Zero and Succ). What happens when we evaluate tail (cons 1 (cons 2 R)), for some stream R (I don’t want to bother creating an infinite stream). I’ll use the HNF strategy that I discussed previously. Let’s watch:

```   tail (cons 1 (cons 2 R))
(\s z. s (z+1)) (cons 1 (cons 2 R))
(\z. cons 1 (cons 2 R) (Succ z))
(\z. (\z'. case z' of { Zero -> 1; Succ z' -> cons 2 R z' }) (Succ z))
(\z. case Succ z of { Zero -> 1; Succ z' -> cons 2 R z' })
(\z. cons 2 R z)
...
```

Where did the 1 go? Gone, collected by the garbage collector, just as if we had written Stream in the more conventional way. GHC would have held on to the 1 forever. Nat -> a behaves exactly like the Stream data structure. The structure of Nat has induced a structure for Nat -> a. That is a beautiful form of data structure composability.

Working in a language that uses a lazy specializer also comes with a very important mental advantage: abstracting a pattern using a higher-order function always has constant-time overhead. I.e. you will never make your program run faster by unfolding abstractions. This encourages the more composable higher-order style, providing a snappy comeback to any speed nazi. It also removes the encouragement to fiddle with the details of a function for more speed: beta-equivalent formulations will all turn into the same code in the end, so clarity is the only remaining variable, and can be maximized.

Ideas like GHC’s {-# RULES #-} are no longer fickle compiler constructions, but can now be made trustworthy optimizations. If the left side of a rule is in a certain form (a restricted sort of normal form), then its evaluation can be baked into the interpreter and will be applied as aggressively as possible. It is invariant to fiddling around with higher-order presentations and whatnot; they will be folded away and the rule can be seen. (But I’m not convinced RULES are necessary anymore; perhaps you can simply express the rules you want as a functional construction?)

The most convincing property of lazy specializers to me was the primary bench test for Thyer’s dissertation: the “tower of interpreters” test. This is the property that, if you write an interpreter for a language L, and in language L write an interpreter for M, and in M write an interpreter for N, etc., all the layers will be squashed away and you will end up with code that runs just as fast as it did in the base language (with a constant time overhead for eliminating the layers). You can design a language with all the bells and whistles you want, give it a semantics, and you may immediately use without a significant penalty. This is a huge abstraction win.

But there are a fair number of open problems in lazy specialization, more than just getting faster specialization machines. The explosion problem is tricky, in which, say, fix gradually becomes specialized to things like \f. f (f (f (f (f (f (f (fix f))))))); not useful, just eats up memory. The leap from complete laziness to optimal evaluation needs to be understood better. But in particular, I think we need a real LS to program with for a while, just to see what its dynamics are and to bring out its real-world problems.

I leave you with an addendum (which I snipped from the main argument), explaining my perceived connection between LS and FRP, my personal holy grail. All of my research links back to FRP somehow:

Finally, as if you didn’t know this was coming, there is a connection between lazy specializers and FRP, the holy grail of my functional programming life. It is a particular manifestation of the first point I made, designing streams as functions. I figure, all I have to do is design a data structure for time, with the right domain structure, perhaps something like:

```   data Time = Zero | Next Double Time
```

And then define our friends:

```   type Behavior a = Time -> a
type Future a = (Time, a)
```

The implementation types exactly equal to their desired semantics, the most beautiful implementation I could ask for. Evaluation for a behavior b has a structure like:

```   sample0 = b Zero
b1 = b . Next 0.033
sample1 = b1 Zero
b2 = b1 . Next 0.033
sample2 = b2 Zero
...
```

Each successive b is optimized, the unneeded bits at the front thrown off just as in the Stream example. The reason this can’t be encoded in regular Haskell is because the algebra for FRP includes a value of type Future (Behavior a), and we can’t see inside the Future because it is a functor, and thus arbitrary functions separate us from the enclosed Behavior, about which we must selectively forget things. However, the lazy specializer has no trouble looking through those functions, and will modify the behavior anyway, ridding us of the dreaded space leak.

## 18 thoughts on “Emphasizing Specialization”

1. Josef Svenningsson says:

Thanks for bringing up this subject! I wasn’t aware of lazy specialization before but it looks very interesting. I will make sure to read the thesis soon.

2. Wow, Luke! This post inspires the heck out of me.

I’ve long been bothered that *functional* programming penalizes programming with functions, in contrast with data structures. Data structure accesses are automatically memoized, while function “accesses” are recomputed unless one adds explicit memoization. I’m not sure I was aware that the same preferential treatment holds for GC, as you point out here. Now, however, I realize that this bias has influenced me away from function representations toward data structure representations for FRP, *especially* with “Push-Pull Functional Reactive Programming”.

I’m thrilled about the direction you’re pointing out for removing the penalties for using *functions* much more consistently in *functional* programming.

3. augustss says:

Interesting, since yesterday I wrote a specializer.
To cope with the cost of specialization you have to manually indicate the points in a function definition where you want to specialize. So when given some initial number of arguments of a function a symbolic evaluator is invoked and then code is generated for the special function taking the remaining arguments.

For instance, the function “f x = let x2=x*x in \ y -> y+x2” will take one argument and generate specialized code for the \ y -> …
So (f 5) will generate code that adds 25 to its argument.

This is all done using LLVM for generating code for all the stages.

And I’m doing this for work, so I can’t share.

4. Ping Zhu and Siau Cheng Khoo’s work on library specialization suggests a way to specialize on the fly without generating any new machine instructions on the fly. It would be great if applying a curried function partially automatically invoked specialization.

5. augustss says:

You can’t do good specialization if you don’t generate machine code at runtime.

6. augustss: That may be true for some notion of “good”, so my question is, do you need good specialization to get the benefits of lazy specialization that Luke described?

7. augustss says:

If you’re happy with interpretative performance then runtime CG is not needed.

8. Paul Liu says:

This particular area seems to have received little attention from the main stream language researchers. I think the major problem is lack of motivation. I’m glad that you find Thyer’s work inspiring, and you are now writing about it, passionately.

I’ve implemented an optimal evaluator in Haskell based on Lambdascope (complete with a graphical user interface) two years ago when I was working on the space leak problem in FRP. So I’ll just share some random notes I have in this topic if you don’t mind.

It’s interesting to note that Thyer’s thesis and Sinot’s Natural Semantics for Complete Laziness paper approached the same problem from two different angles: one by delaying substitution, and the other by sharing evaluation context.

The difference between complete laziness and optimal is not very clear except that both Thyer and Sinot gave a few examples showing that they are not the same.

In terms of implementation, the overhead of lazy specialization (or complete laziness), and optimal evaluation is not well understood either. As we know it, optimal evaluation performs the least number of beta reductions, but in doing so, it must maintain a complete knowledge of sharing, which sometimes could outweigh the advantage it brings.

Interaction Nets (IN) [1] is often used as a means to encode various evaluation strategies including optimal, but so far I’m not aware of any encoding of complete laziness in IN.

There is also a difference between compile time and runtime specialization. I believe the former is better understood, and in practice, most people are already comfortable with compile time specializations (and hence lack motivations to pursue the latter).

I point this out because I wonder if complete laziness (which is runtime specialization) is of any use for compile time. Not sure why, but I feel that it could be more general than other tricks including lambda lifting, deforestation, and various fusion techniques.

[1]: see related works from Ian Mackie, Jorge Sousa Pinto, and Franciois-regis Sinot.

9. Luke says:

@Conal: Thanks! Yes, I’m finding that remembering and forgetting are fundamentally connected. Basically, if you see enough of some structure not to recompute it, you can see enough to know what it doesn’t care about.

@augustss: There is a big difference between lazy specialization and traditional specialization. In particular, the static/dynamic markings are unnecessary, because it’s OK to generate infinite terms. The halting problem is relativized into the generated code, so it is safe to treat all arguments as static.

I’m not so much interested in the “raw performance” associated with traditional specialization, but more the large-scale properties like constant-cost abstraction, deep GC, and the occasional exponential improvements.

@Paul: Thanks for the links! I checked out Sinot’s paper; I find his semantics a lot easier to swallow than Thyer’s. The memo table thing really bothered me. I was half way to re-inventing Sinot’s idea while trying to implement complete laziness avoiding memo tables. I think I’ve got a good picture of what a completely lazy virtual machine would look like, so I might be off to the prototyping bench.

10. Nobody says:

Interesting. I think it would have been helpful to include the types for cons and tail. I worked on it a bit and got

cons :: a -> Stream a -> Stream a
tail :: Stream a -> Stream a

11. Interesting bits and pieces.

In particular, what I’ve always wanted was the ability to provide a magic primitive like ‘optimize :: a -> a’ that could play the role of the lazy specializer, and let you sprinkle it throughout your code as desired.

On the subject of forgetting, I found an old paper on using free theorems to improve garbage collection.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.1254

I’ve been curious to see if that approach could be used to plug many of the space leaks intrinsic to FRP at this time, but have no good way to concoct a test.

12. axel says:

I’m no expert at this, but the following might be interesting:

Noam Zeilberger’s PhD thesis, “On the Logical Basis of Evaluation Order and Pattern Matching” (http://www.cs.cmu.edu/~noam/thesis.pdf), “aims to give a fresh take on the proofs-as-programs analogy—one which better accounts for features of modern programming languages—by starting from a different logical foundation. Inspired by Andreoli’s focusing proofs for linear logic, we explain how to axiomatize certain canonical forms of logical reasoning through a notion of pattern. Propositions come with an intrinsic polarity, based on whether they are defined by patterns of proof, or by patterns of refutation. Applying the analogy, we then obtain a programming language with built-in support for pattern-matching, in which evaluation order is explicitly reflected at the level of types—and hence can be controlled locally, rather than being an ad hoc, global policy decision.”

Could different kinds of laziness (and/or related information) be encoded in the type of the functions? Would this reduce the overhead?

13. cmcq says:

How about adding decomposition explicitly? Any sum type would come with a map that can unwrap a case statement. So for streams you’d have:

uncaseStream :: (Stream a -> c) -> (a -> c, Stream a -> c)

And uncaseStream (cons x xs) = (x,xs) operationally – unlike “cons x xs . Cons”, uncase would let you take the xs without keeping a reference to the x.

14. cmcq says:

That should be:
uncaseStream :: (Nat -> a) -> (Zero -> a, Nat -> a)
And “cons x xs . Succ = xs”

15. Jake McArthur says:

There is an interesting difference between Thyer’s and Sinot’s semantics. Sinot includes this example to demonstrate why completely lazy evaluation is not optimal evaluation:

(λx.x I)(λy.∆ (y I)) where I = λw.w and ∆ = λz.z z

Sinot notes that the semantics in the paper will perform the reduction ∆ (y I) → (y I) (y I), although the optimal thing to do is to wait until y has been instantiated. Thyer’s semantics, however, addresses this with the memo tables. The reduction will still be performed, but it will cause no harm. When we later substitute for y, it will preserve the sharing of the two occurrences of (y I), so there is not actually any duplication.

Perhaps disappointingly, this means the memo tables still serve some good.