# Emphasizing Specialization

I have briefly—or perhaps not so briefly, my interests being unpredictable beasts—shifted the focus of my research to lazy specialization. This topic was discussed extensively, but not comprehensively, in Michael Jonathan Thyer’s dissertation, Lazy Specialization, and seems to have received little attention otherwise. This is unfortunate. I consider it a very important topic.

Lazy specialization is the idea that a functional program be incrementally optimized during its execution, in tandem with partial application. For example, consider the code:

```    let g = f x in (g y, g y')
```

A lazy specializer would evaluate g as far as it could using only information about x, and then apply the resulting optimized function to y and y’.

If most of f’s work could be done using x alone, you could save a lot of time this way. That is the obvious appeal. The drawback on the other side is that the evaluation machinery of a lazy specializer is more involved than a whnf evaluator, so the whole system takes a speed hit as a result.

But so far as I can tell, the only such machines in existence are in a high-level, symbolic, proof-of-concept form, in which the hit is several orders of magnitude. Where’s the research getting down and dirty with this technology, seeing if we can bring it within a factor of 10 or less? Let me argue why we would want to pursue this seriously:

Here is a quote by Heinrich Apfelmus on laziness, at the end of his article on lazy sorting:

Well, it’s highly unlikely that algorithms get faster by introducing laziness. I mean, lazy evaluation means to evaluate only those things that are really needed and any good algorithm will be formulated in a way such that the unnecessary things have already been stripped off. But laziness allows to simplify and compose algorithms. Sometimes, seemingly different algorithms turn out to be two sides of the same coin when formulated with lazy evaluation. Isn’t it great that finding the k-th minimum is not only an adaption of quicksort but can readily be obtained from it by composing it with (!! k)?

I’d say that data structures are the sink for most of the creativity of lazy functional programmers. We spend lots of energy getting a data structure to represent exactly the things we need, sharing what needs to be shared. The key point about data structures is that they can be decomposed; you can peel off the root of a tree and leave yourself with only one of the branches, and the other branch will be garbage collected.

A lazy specializer promotes full-blown functions to the level of data structures. Functions can now be decomposed (by composing!) in the same way, sharing important pieces and forgetting unimportant ones. Observe this stream implementation:

```    type Stream a = Nat -> a
cons x xs z = case z of { Zero -> x; Succ z' -> xs z' }
tail s z = s (Succ z)
```

Where Nat is the standard lazy natural type (constructors Zero and Succ). What happens when we evaluate tail (cons 1 (cons 2 R)), for some stream R (I don’t want to bother creating an infinite stream). I’ll use the HNF strategy that I discussed previously. Let’s watch:

```   tail (cons 1 (cons 2 R))
(\s z. s (z+1)) (cons 1 (cons 2 R))
(\z. cons 1 (cons 2 R) (Succ z))
(\z. (\z'. case z' of { Zero -> 1; Succ z' -> cons 2 R z' }) (Succ z))
(\z. case Succ z of { Zero -> 1; Succ z' -> cons 2 R z' })
(\z. cons 2 R z)
...
```

Where did the 1 go? Gone, collected by the garbage collector, just as if we had written Stream in the more conventional way. GHC would have held on to the 1 forever. Nat -> a behaves exactly like the Stream data structure. The structure of Nat has induced a structure for Nat -> a. That is a beautiful form of data structure composability.

Working in a language that uses a lazy specializer also comes with a very important mental advantage: abstracting a pattern using a higher-order function always has constant-time overhead. I.e. you will never make your program run faster by unfolding abstractions. This encourages the more composable higher-order style, providing a snappy comeback to any speed nazi. It also removes the encouragement to fiddle with the details of a function for more speed: beta-equivalent formulations will all turn into the same code in the end, so clarity is the only remaining variable, and can be maximized.

Ideas like GHC’s {-# RULES #-} are no longer fickle compiler constructions, but can now be made trustworthy optimizations. If the left side of a rule is in a certain form (a restricted sort of normal form), then its evaluation can be baked into the interpreter and will be applied as aggressively as possible. It is invariant to fiddling around with higher-order presentations and whatnot; they will be folded away and the rule can be seen. (But I’m not convinced RULES are necessary anymore; perhaps you can simply express the rules you want as a functional construction?)

The most convincing property of lazy specializers to me was the primary bench test for Thyer’s dissertation: the “tower of interpreters” test. This is the property that, if you write an interpreter for a language L, and in language L write an interpreter for M, and in M write an interpreter for N, etc., all the layers will be squashed away and you will end up with code that runs just as fast as it did in the base language (with a constant time overhead for eliminating the layers). You can design a language with all the bells and whistles you want, give it a semantics, and you may immediately use without a significant penalty. This is a huge abstraction win.

But there are a fair number of open problems in lazy specialization, more than just getting faster specialization machines. The explosion problem is tricky, in which, say, fix gradually becomes specialized to things like \f. f (f (f (f (f (f (f (fix f))))))); not useful, just eats up memory. The leap from complete laziness to optimal evaluation needs to be understood better. But in particular, I think we need a real LS to program with for a while, just to see what its dynamics are and to bring out its real-world problems.

I leave you with an addendum (which I snipped from the main argument), explaining my perceived connection between LS and FRP, my personal holy grail. All of my research links back to FRP somehow:

Finally, as if you didn’t know this was coming, there is a connection between lazy specializers and FRP, the holy grail of my functional programming life. It is a particular manifestation of the first point I made, designing streams as functions. I figure, all I have to do is design a data structure for time, with the right domain structure, perhaps something like:

```   data Time = Zero | Next Double Time
```

And then define our friends:

```   type Behavior a = Time -> a
type Future a = (Time, a)
```

The implementation types exactly equal to their desired semantics, the most beautiful implementation I could ask for. Evaluation for a behavior b has a structure like:

```   sample0 = b Zero
b1 = b . Next 0.033
sample1 = b1 Zero
b2 = b1 . Next 0.033
sample2 = b2 Zero
...
```

Each successive b is optimized, the unneeded bits at the front thrown off just as in the Stream example. The reason this can’t be encoded in regular Haskell is because the algebra for FRP includes a value of type Future (Behavior a), and we can’t see inside the Future because it is a functor, and thus arbitrary functions separate us from the enclosed Behavior, about which we must selectively forget things. However, the lazy specializer has no trouble looking through those functions, and will modify the behavior anyway, ridding us of the dreaded space leak.

# On the By functions

Here’s a quick little note that has been bugging me for a while, and nobody wants to talk about it right now on IRC.

I think the By functions:

```sortBy :: (a -> a -> Ordering) -> [a] -> [a]
maximumBy :: (a -> a -> Ordering) -> [a] -> a
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
nubBy :: (a -> a -> Bool) -> [a] -> [a]
```

Etc. should be replaced by On functions:

```sortOn :: (Ord b) => (a -> b) -> [a] -> [a]
maximumOn :: (Ord b) => (a -> b) -> [a] -> a
groupOn :: (Eq b) => (a -> b) -> [a] -> [[a]]
nubOn :: (Eq b) => (a -> b) -> [a] -> [a]
```

My argument is: the functions provided to sortBy etc. have some preconditions. sortBy is not well-defined (or oughtn’t be) for functions which are not linear ordering functions; nubBy is shouldn’t be well-defined (to the dismay of some 31337s) for functions which do not encode an equivalence relation. But the folklore is that functions are typically “as total as possible”, so if it wants a function of some type, all I have to do is conjure a function of that type and my output will be something reasonable in terms of that function.

On the other hand, the folklore of typeclasses is that they typically come with laws. You need to prove — or at least think you know how to prove — some laws when you make a type an instance of a typeclass. The On functions use this obligation to encode their precondition. They are easier to use in a verified setting, too; there are a bunch of standard instances of Eq and Ord for which the laws are known to hold; map your data on to that in any way you like and the precondition is guaranteed.

Thoughts?

# My Premonition

Last summer I got the opportunity to work briefly with Conal Elliott in Belgium. The two weeks I spent there are still making an impact on me; immediately I resonated with his semantic design idea. But the more important ideas were the ones I picked up subconsciously, and I’m just now becoming able to put them in words.

(1) If extreme idealism isn’t working out for you, maybe it is because you are not wholeheartedly following your own ideals.

(2) You must be the change you wish to see in the world (– Mahatma Gandhi). As applied to software: design software as if it were the beautiful paradise you want it to be, then build pieces of the scaffolding back to the status quo.

Both of these were key insights leading me to undertake Dana.

At the time I was desperately trying to create an FRP framework for Haskell, trying in terms of IO and STM and playing messy tricks with the operational semantics of thunks. It felt like cheating. I had the desire to throw away IO altogether in my efforts, just to see where that led me (this forced me to put aside the way I thought an FRP interface should look). This was an instance of (1).

But in Haskell, IO would always be beneath it all. Your functional program always becomes an imperative program at the end of the day. Maybe this dissonance is what is causing the endless FRP difficulties. Maybe we can’t really know what pure FRP is until we have pure turtles all the way down — until there is no impedance mismatch between the functional and the imperative. This set me on researching the idea of a purely functional operating system, which is still the slogan by which I name Dana.

But now Dana is more than that. Now, it is my (2). In this post, I describe how I see the world of software in 30 years. This isn’t about what kind of devices computers will be embedded in or how they are connected; this is about the process of developing software.

Code and program are the same; any fragment of code may be used interactively. Complex programs are just compositions of simpler ones, and composition of programs is a fundamental concept to both users and programmers. Thus power users cannot help but to be programmers.

Computer science has become a strict subfield of applied mathematics. Creating complex software is primarily modeling: the fundamental task in software development is to create a mathematical model of your problem. The model is encoded directly, and then software scaffolding is built around it, verified to the model. Reasonable shops do not dare ship code that hasn’t been verified to correspond to their model.

Development tools — programs that talk about programs — are astoundingly rich and expressive. Simply enter a property about your code. The tool goes off on a model search to try to refute the property. Meanwhile, in a way that my mind cannot yet comprehend, the programmer communicates the “essence” of why the property should be true, and the tool fills in all the details of the proof. The property is added to a database of facts, which everyone can immediately see and use.

Provably correct code is what people do, because the tools have made it so easy that it is an obvious engineering choice (this ease does not come without a resculpting of our brains). Gargantuan test suites are replaced by a few hours of a developer with a proving tool (an exponential increase in productivity came from the fact that theorems compose, but tests don’t).

User interfaces are mostly an artistic task, quite different from the rigorous visualization that modelers do. All objects have a few obvious ways to endow them with primitive user interfaces (see Eros). Then making a nice user interface is a matter of rearranging them, composing inputs and outputs, and drawing/animating things (under the assumption that computers are still monitor, keyboard, and mouse driven — make the appropriate generalizations).

The shops that will be renowned by power users are those who have a great program with a great interface, but they also ship components to the underlying logic. This is trivial for them, because of the ubiquitous separation between interface and logic. Some are still irrationally afraid of giving away their trade secrets by exposing this, so there are still lame, interface-only programs. But these lame programs will not have the chance to participate in the miracles of engineering the decade will bring, a combiatorical build up of the the work of millions of developers, coming together to create the Star Trek computer — or something else, far more advanced.

Computation is an information science, and is subject to the exponential explosion that comes with it. This Utopian vision hinges on correct composition of everything, from programs to facts.

Right now, software is primarily non-compositional. We have libraries and languages, which are built up of smaller libraries and languages. But end products are not designed to be composed; they are mostly designed for the end user only. But the end user is where the money is, thus where a great deal of the effort in the industry is going. Moreover, for advanced programs (say, Maya), the end user is also where a lot of the skill and talent is.

We have to work at this. There is little small-scale economic incentive to make your program composable; you just give your competitors the ability to use your software without you being able to use theirs. And it’s hard for idealistic young CEOs to try such a thing, since the state of software development means you have to put in a lot of work both to make your program composable and to compose other people’s programs. But these things are not fundamentally hard, they are just hard using the languages and methods of today.

Dana is my own first step in this direction. It aims to be an environment which forces everything to be composable and verifiable. This comes at the expense of mental laziness: you will have to seriously rewire your brain to work in it. I don’t believe that the enlightenment can come without this rewiring. I’m desperately trying to rewire my own, to even consider the choices that will be obvious to the generation ahead.

# It is never safe to cheat

Anyone who has spent time trying to implement an FRP library knows the unsafePerformIO story. You may use unsafePerformIO as long as you ensure that the result maintains purely functional semantics. It’s possible to create impure values with unsafePerformIO. It is up to you to “prove” that you have created a pure one. Seems like a decent trade-off.

I put “prove” in quotes for a reason. If you’re doing something nontrivial (i.e. you’re not just using unsafePerformIO . return), you need an operational semantics for IO to prove this. But that’s not all! You are probably depending on some external state inside the unsafePerformIO, which depends on the time and order in which thunks are evaluated. But thunks aren’t part of the operational semantics of IO, they are part of the operational semantics of pure values in Haskell — something we quite explicitly do not have. So you need not only to embrace the ill-definedness of IO, but in fact tie yourself down to a particular operational interpretation of Haskell!

Let’s say I write an HNF evaluator for Haskell. Your unsafePerformIO magic will probably not work on this style of evaluator because the meaning of thunks — and the way they are executed — is quite different in this style.

There are more invariants on a Haskell function than purity and referential transparency. We can of course only implement computable functions. They have to be monotone and continuous. And they might be other things, as well, which someone one day will come along and prove by leveraging properties of the type system, exposed primivites, etc. (see ST for prior art). But they have not accounted for your magic, so their analysis does not include any program using your library.

We are pure functonal programmers. We have chosen a language which vastly restricts what we are allowed to do, because we understand the benefits we reap as a result. However, pretending to understand when we are allowed to cheat only buys us the benefits we know about now, but precludes future benefits from work in the field. By using unsafePerformIO — even in a safe way (or so you think) — you avert the exponential growth of our field.

If you find you can’t express something you feel you should be able to, I suggest one of two things: (1) look deeper until you find an incidental limitation of the language, and attempt to solve it at the language level, or (2) look deeper until you understand why you actually shouldn’t have been able to do that, revealing the truth from behind the curtain of zealous ignorance. In my experience, (2) is much more often successful.

On a more practical note, most of said limitations are about performance, which is not in the semantics’ domain of discourse. It makes sense that our languages wouldn’t be good at such things. Instead of introducing a hack, why not push the field forward and think about what a language which can talk about such things would look like? Each time you run into a limitation, you have a new use case, and thus a new perspective on the problem.

Until that problem is solved, though, your library users might have to pay the price of not having as elegant an interface. But by restricting yourself thusly, you are protecting yourself from your own ignorance, at least knowing that what you have made should, in fact, be makeable.

In summary: It is never safe to cheat.

# Lazy Partial Evaluation

Inspired by Dan Piponi’s latest post, I have been looking into partial evaluation. In particular, I thought that a language which emphasizes currying really ought be good at partial evaluation. In this post I describe some ideas regarding partial evaluation in functional languages, and later sketch a partial evaluation machine I devised.

Supercombinator reduction machines, like GHC, do a limited form of partial evaluation. I.e. when you compile a program to supercombinators, you are optimizing it for specialization from left to right. So if f is defined in pointful form, “let a = f x in (a z, a z)” might be a better program than “(f x z, f x z)”. This is a nice property of combinator reduction. Unfortunately, it doesn’t generalize: “let a = flip f x in (a y, a y)” will never be better than “(f y x, f y x)”, because functions only specialize from left to right. I conjecture that this missing feature is more important than we realize.

Mogensen gives a very elegant partial evaluator in pure lambda calculus, which optimize as expected with the Futamura projections (see Dan’s post). This partial evaluator works on higher order abstract syntax, taking and returning descriptions of terms rather than the terms themselves. Essentially all it is is (very simple) machinery describing how to evaluate under a lambda.

The system in that paper takes many precautions to avoid unfolding dynamic arguments, because otherwise the partial evaluator might not terminate. Apparently he is not well-versed in our Haskell School of the Infinite, because the evaluator is compositional. So what he means by “not terminate” is “return an infinite program”. But an infinite program is fine if you interpret/compile it lazily!

In fact, I believe (I am intuiting — I have done no formal checking) that the simplest-minded of lazy partial evaluators is perfect: it maximally unfolds its static arguments, there is no need for the type inference machinery in that paper, and it will have the same termination properties as the program. I attribute the ease of this task with the built-in metacircularity of lambda calculus.

Cool as a self-embedded partial evaluator is, to do efficient partial evaluation you need to keep quotations of your programs everywhere, then compile them before you actually use them. Lambda calculus is hinting at something more: that simply by applying one of several arguments to a curried function, you are specializing it. Wouldn’t it be great if every time you did that, the program were maximally specialized automatically?

### A partial evaluation reduction strategy

It turns out that such an ambitious wish is nothing more than an evaluation order for the lambda calculus. Admittedly, it’s not a very common one. You try to reduce the leftmost application, even under lambdas. We would also like to combine this with call-by-need, so when an argument is duplicated it is only reduced once.

Here’s an example program I’ve been working with, with the standard definitions of the combinators Ix = x and Kxy = x.

```  flip (\x y. y I x) K K
```

It’s not much, but it gets the point across. Let’s look at it in call-by-name:

```[1]  (\f a b. f b a) (\x y. y I x) K K
[2]  (\a b. (\x y. y I x) b a) K K
[3]  (\b. (\x y. y I x) b K) K
[4]  (\x y. y I x) K K
[5]  (\y. y I K) K
[6]  K I K
[6'] (\m n. m) I K
[7] (\n. I) K
[8]  I
```

Notice by the step [4] that we have lost the structure of flip (\x y. y I x) K, so any work we do from then on we will have to redo on subsequent applications of that function. Contrast this with the partial evaluation strategy:

```[1]  (\f a b. f b a) (\x y. y I x) K K
[2]  (\a b. (\x y. y I x) b a) K K
[3]  (\a b. (\y. y I b) a) K K
[4]  (\a b. a I b) K K
[5]  (\b. K I b) K
[5'] (\b. (\m n. m) I b) K
[6]  (\b. (\n. I) b) K
[7]  (\b. I) K
[8]  I
```

We got the function all the way down to a constant function before it was finally applied.

One thing that’s interesting to notice about this strategy is that it seems stricter than call-by-name. That is, if you have a nonterminating term N, then reducing the application (\x y. Ny) z will loop, whereas it won’t in CBN. However, recall that in the domain theory, (\x. ⊥) = ⊥. The only thing you can do to a function to observe it is apply it, and whenever you apply this function you will loop. So you are bound to loop anyway if you are evaluating this application.

### The lazy partial evaluation machine (sketch)

Here is a sketch of an efficient low-level machine for this evaluation strategy. It is simple stack machine code (“which I pull out of my bum for now, but I don’t think an algorithm to generate it will be any trouble”). The only tricky bit about it is that pieces are sometimes removed from the middle of the stack, so it can’t necessarily be represented linearly in memory. A singly linked representation should work (I realize this costs performance).

The values in this language look like [v1,v2,…] { instr1 ; instr2 ; … }. They are closures, where the vn are pushed on to the stack in that order before the instructions are executed. Values can also be “indirections”, which point to an absolute position in the stack. Indirections represent a logical reordering of the stack, and are used to model bound variables. When indirections are executed, they remove themselves and execute (and remove) the thing they point to. The instructions are as follows.

```  pop n      -- delete the entry at position n
dup n      -- push the entry at position n on the top of the stack
(other standard stack ops here)
abs n      -- push an indirection pointing to position n
-- (skipping other indirections) on top of the stack
exec n     -- pop and execute the closure at position n
closure [n1,n2,...] { instr1 ; instr2 }
-- remove positions n1,n2,... and add them
-- to the closure with the given code, and push it
```

dup is the only instruction which duplicates a value; this is where laziness will be encoded.

Let’s look at the code for each piece of our example program:

```    I: [] { exec 0 }
K: [] { pop 1; exec 0 }
```

These are pretty straightforward. They both receive their arguments on the stack (the first argument in position 0, and downwards from there), reduce and continue execution. Recall the code is what happens when a value is forced, so every value ends with an exec to continue the execution.

```    (\x y. y I x): [] { push I; exec 2 }
```

This one might be a little tricker. The stack comes in like this (position 0 on the left): x y, After push I, it looks like “I x y”, so y’s incoming stack will be “I x” just as it wants. Finally, the interesting one:

```   (\f a b. f b a): [] { abs 2; exec 2 }
```

abs 2 pushes an indirection to the argument b onto the stack before calling f. This is how evaluation is pulled under binders; when an indirection is forced, it reduces the application at the appropriate level, and all below it. I am still not totally sure when to introduce an abs; my guess is you do it whenever a function would otherwise reorder its arguments. An example may demonstrate what I mean (but perhaps not; I haven’t totally wrapped my head around it myself).

Here’s an execution trace for the example above, with the stack growing to the left. I start with the instruction pointer on the first function and the three arguments on the stack. The stack shown is the state before the instruction on each row. An identifier followed by a colon marks a position in the stack for indirections:

 1 abs 1 (\x y. y I x) K K 2 abs 3 a (\x y. y I x) a:K K 3 exec 2 b a (\x y. y I x) a:K b:K 4 closure I b a a:K b:K 5 exec 2 I b a a:K b:K 6 pop 1 I b b:K 7 exec 0 I

When an indirection is executed, as in step 5, that is evaluation under a lambda.

This machine still doesn’t support laziness (though it didn’t matter in this example). We can achieve laziness by allocating a thunk when we dup. To evaluate the thunk, we put a mark on the stack. Whenever an instruction tries to reach beyond the mark, we capture the current stack and instruction pointer and jam it in a closure, then write that closure to the thunk. Indirections get replaced by their offsets; i.e. the “abs” commands that would create them. After we have done that, remove the mark (point it to where it was previously) and continue where we left off.

There you have it: my nifty partial evaluation machine. I’m reasonably confident that it’s correct, but I’m still not totally happy with the implementation of indirections — mostly the fact that you have to skip other indirections when you are pushing them. I wonder if there is a better way to get the same effect.

# Dana (actual) progress

I have some very exciting news! I wrote some actual code in the Dana repository. It is the Certificate module. That is, it’s an abstract data type Proof, such that only valid proofs in IΞ can be constructed.

The certificate module (IXi.Term and IXi.Proof together) is about 280 lines, which isn’t fantastically small, but isn’t too bad. This is especially considering that I don’t expect it to grow—it is perfect, modulo bugs.

Right now, Proof is just a proof checker function, but it’s designed so that I could swap it out with some serializable format, or even (crosses fingers) a dependent certified type.

One interesting development in this implementation is the new conversion certificate for lambda calculus with De Bruijn notation. That is, objects of type Conversion represent a valid βη conversion between terms. Previously, I hadn’t been satisfied with my solution: the certificate was implemented as a pair of convertible terms. This led to far too many equality comparisons of terms when combining and using certificates, which is both inefficient and I suspect would be hard to prove things about. Also, it required you to know too much about the goal you were proving, bloating the proof terms and tangling them with the theorems they were trying to prove.

The hardest part is β expansion, for example turning X into (\x.\y.x) X (\x.x). That was the reason for the pair representation: to do beta expansion, you just did a beta reduction and reversed it.

The new implementation instead implements conversions as partial functions. I.e. you give a conversion a source term, and it gives you an equivalent term (or says it couldn’t convert). This means I had to separately model beta reduction and beta expansion, because you can’t easily reverse a conversion. However, the solution is quite clean. I chose a basis of expansion combinators, which can be composed to form any expansion. They are:

```    Identity : A → (\x. x) A
Constant : A → (\x. A) B    [x not free in A]
Apply : (\x. A) C ((\x. B) C) → (\x. A B) C
Lambda : \y. (\x. A) B → (\x. \y. A) B   [y not free in B]
```

This is in addition to the other combinators, which are needed to make this basis complete. They include β reduction, η expansion/contraction, and ways to focus a conversion on a subexpression. The key is that each combinator is correct by inspection, so we can be confident that the conversion algebra is sound.

I chose these combinators by thinking about what would be needed to construct the inverse conversion from bringing a term to normal form. If you’re familiar with SKI factorization, the process is pretty similar. Whenever you reduce an application (\x. A) B, you look at the structure of A and “push” B in by one level, applying one of these combinators. For example:

 Term Conversion (\f. \y. f (f y)) (\x. x) Lambda \y. (\f. f (f y)) (\x. x) inLambda Apply \y. (\f. f) (\x. x) ((\f. f y) (\x. x)) inLambda (inLeft Identity) \y. (\x. x) ((\f. f y) (\x. x)) inLambda Identity \y. (\f. f y) (\x. x) inLambda Apply \y. (\f. f) (\x. x) ((\f. y) (\x. x)) inLambda (inLeft Identity) \y. (\x. x) ((\f. y) (\x. x)) inLambda Identity \y. (\f. y) (\x. x) inLambda (Constant (\x. x)) \y. y

The reverse composition of the conversions on the right will bring us from \y. y to (\f. \y. f (f y)) (\x. x).

But isn’t it an awful pain to write all those combinators when proving things? Of course not! I make a computer do it for me. I have a little algorithm which takes two terms and computes a conversion between them, by bringing them both to normal form, and using the forward conversions one way and the inverse conversions the other way. Of course, if I give it terms which have no normal form it won’t halt, but the idea is that these terms are static: I use dummy terms to explain the conversion I want, and then apply the conversion I got back to the real terms (which may have subterms without normal forms).

So I say: get me from (\x y. A x) A I to (\x. x x) A, where “A” and “I” are just strings, and then I apply the conversion I got back to, say, (\x. \y. WW x) (WW) (\x. x), where WW has no normal form. The conversion still succeeds.

The certificate pattern shines here: my constructors are easy to verify, then I have a fairly involved algorithm for constructing certificates that is easy to use, which is guaranteed (at least one sort of) correct by construction.

So that’s fun stuff.

Proofs are still pretty tedious, however. My next step is to make some smart “tactic” combinators (which of course generate the underlying certificates) to make proofs easier. It shouldn’t take too long to make it at least tolerable. Then I’ll build up a library of certified infrastructure necessary for typechecking `Haskell--`, and finally write the compiler to complete the bootstrap. There are plenty of dragons to be slain along the way.

# Certificate Design Pattern

When working the latest incarnation of my System IG compiler, I used a thingy which I now realize ought to be characterized as a design pattern. It substantially changed the way I was thinking about the code, which is what makes it interesting.

Summary: separate an algorithm into certificate constructors and a search algorithm.

A large class of algorithms can be considered, in some way, as search algorithms. It is given a problem and searches for a solution to that problem. For example, typically you wouldn’t phrase the quadratic formula as a search algorithm, but it is—it’s just a very smart, fast one. It is given a,b, and c and searches for a solution to the equation ax2 + bx + c = 0.

The certificate design pattern separates the algorithm into two modules: the certificate module and the algorithm. The certificate module provides constructors for solutions to the problem. For each correct solution, it is possible to construct a certificate, and it is impossible to construct a certificate for an incorrect solution. The certificate module for the quadratic formula algorithm might look like this:

```module Certificate (Certificate, certify, solution) where

data Certificate = Certificate Double Double Double Double

certify :: Double -> Double -> Double -> Double -> Maybe Certificate
certify a b c x | a*x^2 + b*x + c == 0 = Just (Certificate a b c x)
| otherwise            = Nothing

solution :: Certificate -> (Double,Double,Double,Double)
solution (Certificate a b c x) = (a,b,c,x)
```

There is only one way to construct a Certificate, and that is to pass it a solution to the quadratic equation. If it is not actually a solution, a certificate cannot be constructed for it. This module is very easy to verify. The algorithm module is obvious:

```module Algorithm (solve) where
import Certificate
import Data.Maybe (fromJust)

solve :: Double -> Double -> Double -> Certificate
solve a b c = fromJust \$ certify a b c ((-b + sqrt (b^2 - 4*a*c)) / (2*a))

```

Here, we use the quadratic formula and construct a certificate of its correctness. If we made a typo in the formula, then certify would return Nothing and we would get an error when we fromJust it (an error is justified in this case, rather than an exception, because we made a mistake when programming — it’s like an assert).

The client to the algorithm gets a certificate back from solve, and can extract its solution. All the information needed to verify that the certificate is a correct certificate for the given problem should be provided. For example, if Certificate had only contained x instead of a,b,c,x, then we could have implemented solve like:

```solve a b c = certify 0 0 0 0

```

Because that is a valid solution, but we have not solved the problem. The client needs to be able to inspect that a,b,c match the input values. Maximally untrusting client code might look like this:

```unsafeSolve a b c =
let (a',b',c',x) = solution (solve a b c) in assert (a == a' && b == b' && c == c') x
where
assert True x = x
assert False _ = error "Assertion failed"

```

Here we can give any function whatsoever for solve, and we will never report an incorrect answer (replacing the incorrectness with a runtime error).

This is certainly overkill for this example, but in the System IG compiler it makes a lot of sense. I have a small set of rules which form well-typed programs, and have put in much effort to prove this set of rules is consistent and complete. But I want to experiment with different interfaces, different inference algorithms, different optimizations, etc.

So my Certificate implements combinators for each of the rules in my system, and all the different algorithms plug into that set of rules. So whenever I write a typechecker algorithm, if it finds a solution, the solution is correct by construction. This gives me a lot of freedom to play with different techniques.

Verification rules can be more involved than this single function that constructs a certificate. In the System IG compiler, there are 12 construction rules, most of them taking other certificates as arguments (which would make them certificate “combinators”). I’ll show an example of more complex certificate constructors later.

What is interesting about this pattern, aside from the added correctness and verification guarantees, is that is changed the way I thought while I was implementing the algorithm. Instead of being master of the computer, and telling it what to do, it was more like a puzzle I had to solve. In some ways it was harder, but I attribute that to redistributing the workload; it’s harder because I am forced to write code that is correct from the get-go, instead of accidentally introducing bugs and thinking I’m done.

The other interesting mental change was that it often guided my solution. I would look at the certificate I’m trying to create, and see which constructors could create it. This gave me an idea of the information I was after. This information is the information necessary to convince the client that my solution is correct; I cannot proceed without it.

Theoretically, the algorithm part could be completely generic. It might just do a generic search algorithm like Dijkstra. If it finds a certificate, then it has solved your problem correctly. Solutions for free! (But this will not be practical in most cases — it might not yield a correct algorithm by other criteria, such as “always halts”).

Here’s an example of a more complex certificate. The domain is SK combinator calculus, and a Conversion is a certificate that holds two terms. If a Conversion can be constructed, then the two terms are convertible.

```module Conversion ( Term(..), Conversion
, convId, convCompose, convFlip
, convS, convK, convApp)
where

infixl 9 :*
data Term = S | K | Term :* Term   deriving (Eq)
data Conversion = Term :<-> Term

convTerms (a :<-> b) = (a,b)

convId t = t :<-> t

convCompose (a :<-> b) (b' :<-> c)
| b == b' = Just \$ a :<-> c
| otherwise = Nothing

convFlip (a :<-> b) = b :<-> a

convS (S :* x :* y :* z) = Just \$ (S :* x :* y :* z)  :<->  (x :* z :* (y :* z))
convS _ = Nothing

convK (K :* x :* y) = Just \$ (K :* x :* y)  :<->  x
convK _ = Nothing

convApp (a :<-> b) (c :<-> d) = (a :* c) :<->  (b :* d)

```

The export list is key. If we had exported the (:<->) constructor, then it would be possible to create invalid conversions. The correctness of a certificate module is all about what it doesn’t export.

I’m wondering what the best way to present this as an object-oriented pattern is, so I can insert it into popular CS folklore (assuming it’s not already there ;-).

# Dependent types are ridiculously easy

After an inquiry about a combinatory calculus for dependent types, somebody in #haskell — unfortunately I don’t remember who — recommended I look at Illative Combinatory Logic. After studying the paper a bit, all I can say is Wow! It implements a (conjecturally) complete logical system in but three simple typing rules, together with the untyped lambda calculus conversion rules. I was able to write a dependent typechecker in 100 lines.

The specific system I’m working with in the paper is called system IG. The paper also introduces three weaker systems, which are also very nice and worth understanding.

As an example, here is a derivation of the identity function and its type. This is a backward proof, so each step is implied by the step after it. But reading it this way makes the algorithm clear.

```    |- G L (\A. G A (\_. A)) (\A x. x)
L a |- (\A. G A (\_. A)) a ((\A x. x) a)     (Gi)
L a |- G a (\_. a) ((\A x. x) a)             (beta)
L a |- G a (\_. a) (\x. x)                   (beta)
L a, a b |- (\_. a) b ((\x. x) b)            (Gi)
L a, a b |- a ((\x. x) b)                    (beta)
L a, a b |- a b                              (beta)
```

The trick, which isn’t so much a trick as a beautiful deep connection I gather (although I don’t fully grok it), is that typing propositions are just applications. So if A is a type, then A b is the proposition “b has type A”. L is the type of types, and G is essentially Π (the product type constructor). So, for example, the fourth line in the above proof would be written as follows in a conventional system:

```    a : Type  |-  (\x. x) : a -> a
```

I need to extend it with finite types and boxes, but I really don’t want to because it’s so pretty as is! :-)

# Dana update: Core execution language, dependent combinators

It’s been a little while since I gave an update about Dana, so here it is.

There has been very little code progress in the last week. Instead, because of the self-embedding issues I talked about last time, I have been exploring the use of a combinator calculus as an intermediate stage for typechecking. I’m on to something, though nothing is terribly concrete yet. The combinator basis I’m using is:

```    Sxyz = xz(yz)
Ktxy = x    -- with the constraint (y:t)
Bxyz = x(yz)
Cxyz = xzy
Itx = x   -- with the constraint (x:t)
```

Along with primitives for Pi, Arrow (which is always expressible as Pi, but this keeps term size down), and Type.

The point of the constraints is to provide type annotations. I added a type annotation to every combinator which “erases” a lambda (rather than just pushing it downward) during factorization, because when a lambda is erased so is its annotation, so this puts it back.

My goal is for the typechecker not to use any supply of unique names.

Checking is not totally straightforward. Notice how terms like S have no unique most general type; they would have to take some type arguments to do that. So my current lead is that the type of a combinator term t is either a “real” type, or it’s a function which takes a term x and returns the type of t x (or another such function, etc.). This algorithm will be partial; i.e. there will be cases where I require an argument to have an actual type rather than a function. This is because attempting the “total” algorithm is undecidable, so it’s partial anyway.

The other thing I’ve been thinking about is a simple, composable “RTS language”, which does explicit memory management in some form, in which I could implement a garbage collector, etc. This has not been as fruitful as my combinator explorations. I’ve been looking at various forms of linear lambda calculus, which has been neat and enlightening — for example, there is a sense in which all computable functions are differentiable using LLC, which I might blog about soon — but unfortunately it does not look very promising as an RTS language. Any suggestions?

And today I started thinking about what I will need to write my higher-level languages, in particular PureHaskell (Haskell98 sans IO and FFI). It’s pretty annoying that I can’t use any Haskell libraries which use tricky GHC extensions, because those kill my bootstrap (i.e., my PureHaskell compiler needs to be in PureHaskell). There are a lot of libraries that use extensions. For once I wish the flagship compiler weren’t so capable!

# Parallel Rewrite System

There was a recent question on Haskell-cafe about how to parallelize a simple “rule engine”. I spent a while investigating it, and the solution is both pretty and fast. And my response on haskell-cafe is really messy because gmail is silly. So here is my response, cleaned up, in literate Haskell:

```> {-# LANGUAGE RankNTypes #-}
>
> import qualified Data.MemoCombinators as Memo
> import qualified Data.Set as Set
> import Control.Parallel (par)
> import qualified Control.Parallel.Strategies as Par
> import Data.Monoid (Monoid(..))
> import qualified Data.DList as DList
>
```

First, I want to capture the idea of a generative set like you’re doing. GenSet is like a set, with the constructor “genset x xs” which says “if x is in the set, then so are xs”.

I’ll represent it as a stateful computation of the list of things in the set, threading the set of things we’ve seen so far. It’s redundant information, but sets can’t be consumed lazily, thus the list (the set will follow along lazily :-).

Remember that State s a is just the function (s -> (s,a)). So we’re taking the set of things we’ve seen so far, and returning the new elements added and the set unioned with those elements.

```> newtype GenSet a
>       = GenSet (State (Set.Set a) (DList.DList a))
>
> genset :: (Ord a) => a -> GenSet a -> GenSet a
> genset x (GenSet f) = GenSet \$ do
>     seen <- gets (x `Set.member`)
>     if seen
>         then return mempty
>         else fmap (DList.cons x) \$
>                    modify (Set.insert x) >> f
>
> toList :: GenSet a -> [a]
> toList (GenSet f) = DList.toList \$ evalState f Set.empty
```

GenSet is a monoid, where mappend is just union.

```> instance (Ord a) => Monoid (GenSet a) where
>     mempty = GenSet (return mempty)
>     mappend (GenSet a) (GenSet b) =
>                  GenSet (liftM2 mappend a b)
```

Okay, so that’s how we avoid exponential behavior when traversing the tree. We can now just toss around GenSets like they’re sets and everything will be peachy.

Here’s the heart of the algorithm: the reduce function. To avoid recomputation of rules, we could just memoize the rule function. But we’ll do something a little more clever. The function we’ll memoize (“parf”) first sparks a thread computing its *last* child. Because the search is depth-first, it will typically be a while until we get to the last one, so we benefit from the spark (you don’t want to spark a thread computing something you’re about to compute anyway).

```> reduce :: (Ord a) => Memo.Memo a -> (a -> [a]) -> a -> [a]
> reduce memo f x = toList (makeSet x)
>     where
>     makeSet x = genset x . mconcat . map makeSet . f' \$ x
>     f' = memo parf
>     parf a = let ch = f a in
>              ch `seq` (f' (last ch) `par` ch)
```

The ch `seq` is there so that the evaluation of ch and last ch aren’t competing with each other.

Your example had a few problems. You said the rule was supposed to be expensive, but yours was cheap. Also, [x-1,x-2,x-3] are all very near each other, so it’s hard to go do unrelated stuff. I made a fake expensive function before computing the neighbors, and tossed around some prime numbers to scatter the space more.

```> rule :: Int -> [Int]
> rule n = expensive `seq`
>            [next 311 4, next 109 577, next 919 353]
>     where
>     next x y = (x * n + y) `mod` 5000
>     expensive = sum [1..50*n]
>
> main :: IO ()
> main = do
>     let r = reduce Memo.integral rule 1
>     print (length r)
```

The results are quite promising:

```% ghc --make -O2 rules2 -threaded
% time ./rules2
5000
./rules2  13.25s user 0.08s system 99% cpu 13.396 total
% time ./rules2 +RTS -N2
5000
./rules2 +RTS -N2  12.52s user 0.30s system 159% cpu 8.015 total
```

That’s 40% decrease in running time! Woot! I’d love to see what it does on a machine with more than 2 cores.