# My Premonition

Last summer I got the opportunity to work briefly with Conal Elliott in Belgium. The two weeks I spent there are still making an impact on me; immediately I resonated with his semantic design idea. But the more important ideas were the ones I picked up subconsciously, and I’m just now becoming able to put them in words.

(1) If extreme idealism isn’t working out for you, maybe it is because you are not wholeheartedly following your own ideals.

(2) You must be the change you wish to see in the world (– Mahatma Gandhi). As applied to software: design software as if it were the beautiful paradise you want it to be, then build pieces of the scaffolding back to the status quo.

Both of these were key insights leading me to undertake Dana.

At the time I was desperately trying to create an FRP framework for Haskell, trying in terms of IO and STM and playing messy tricks with the operational semantics of thunks. It felt like cheating. I had the desire to throw away IO altogether in my efforts, just to see where that led me (this forced me to put aside the way I thought an FRP interface should look). This was an instance of (1).

But in Haskell, IO would always be beneath it all. Your functional program always becomes an imperative program at the end of the day. Maybe this dissonance is what is causing the endless FRP difficulties. Maybe we can’t really know what pure FRP is until we have pure turtles all the way down — until there is no impedance mismatch between the functional and the imperative. This set me on researching the idea of a purely functional operating system, which is still the slogan by which I name Dana.

But now Dana is more than that. Now, it is my (2). In this post, I describe how I see the world of software in 30 years. This isn’t about what kind of devices computers will be embedded in or how they are connected; this is about the process of developing software.

Code and program are the same; any fragment of code may be used interactively. Complex programs are just compositions of simpler ones, and composition of programs is a fundamental concept to both users and programmers. Thus power users cannot help but to be programmers.

Computer science has become a strict subfield of applied mathematics. Creating complex software is primarily modeling: the fundamental task in software development is to create a mathematical model of your problem. The model is encoded directly, and then software scaffolding is built around it, verified to the model. Reasonable shops do not dare ship code that hasn’t been verified to correspond to their model.

Development tools — programs that talk about programs — are astoundingly rich and expressive. Simply enter a property about your code. The tool goes off on a model search to try to refute the property. Meanwhile, in a way that my mind cannot yet comprehend, the programmer communicates the “essence” of why the property should be true, and the tool fills in all the details of the proof. The property is added to a database of facts, which everyone can immediately see and use.

Provably correct code is what people do, because the tools have made it so easy that it is an obvious engineering choice (this ease does not come without a resculpting of our brains). Gargantuan test suites are replaced by a few hours of a developer with a proving tool (an exponential increase in productivity came from the fact that theorems compose, but tests don’t).

User interfaces are mostly an artistic task, quite different from the rigorous visualization that modelers do. All objects have a few obvious ways to endow them with primitive user interfaces (see Eros). Then making a nice user interface is a matter of rearranging them, composing inputs and outputs, and drawing/animating things (under the assumption that computers are still monitor, keyboard, and mouse driven — make the appropriate generalizations).

The shops that will be renowned by power users are those who have a great program with a great interface, but they also ship components to the underlying logic. This is trivial for them, because of the ubiquitous separation between interface and logic. Some are still irrationally afraid of giving away their trade secrets by exposing this, so there are still lame, interface-only programs. But these lame programs will not have the chance to participate in the miracles of engineering the decade will bring, a combiatorical build up of the the work of millions of developers, coming together to create the Star Trek computer — or something else, far more advanced.

Computation is an information science, and is subject to the exponential explosion that comes with it. This Utopian vision hinges on correct composition of everything, from programs to facts.

Right now, software is primarily non-compositional. We have libraries and languages, which are built up of smaller libraries and languages. But end products are not designed to be composed; they are mostly designed for the end user only. But the end user is where the money is, thus where a great deal of the effort in the industry is going. Moreover, for advanced programs (say, Maya), the end user is also where a lot of the skill and talent is.

We have to work at this. There is little small-scale economic incentive to make your program composable; you just give your competitors the ability to use your software without you being able to use theirs. And it’s hard for idealistic young CEOs to try such a thing, since the state of software development means you have to put in a lot of work both to make your program composable and to compose other people’s programs. But these things are not fundamentally hard, they are just hard using the languages and methods of today.

Dana is my own first step in this direction. It aims to be an environment which forces everything to be composable and verifiable. This comes at the expense of mental laziness: you will have to seriously rewire your brain to work in it. I don’t believe that the enlightenment can come without this rewiring. I’m desperately trying to rewire my own, to even consider the choices that will be obvious to the generation ahead.

Is your brain going to be ready when the revolution comes?

# The role of a core calculus

A lot of software engineering can be described as the art of skillful procrastination. When we’re designing some software, we would like to put off as many decisions as possible as long as possible, while still making progress toward the goal. This procrastination leads to good abstraction, because any code we wrote before we made said decision is independent of that decision, which means the decision can be changed later, leading to “maintainability”.

I would describe the philosophy of the Dana project as: make the right decision or keep thinking. It is a philosophy that only a project without a time limit can afford. This constraint forces reconsideration of the very fundamentals of computer science, since I believe much of our modern infrastructure is built on wrong decisions. I don’t ever expect Dana to be finished; instead, it is a unifying project within which some right decisions may be found.

When Dana becomes something runnable, it will look something like an hourglass: a large body of purely functional code on top, all represented eventually upon an extremely simple core calculus, below which layers of complexity are again added to adapt to existing operating systems and software. The layers above the core are only incidentally related to the layers below it, and I hope by studying whatever the eventual form of the upper layers, we can find ways to rip out the needless complexity below and replace it with something more fitting.

This is all very abstract, and it is hardly clearer in my mind than it is in this picture. However, I’m getting a good sense for what that skinny core is and how it relates to the rest of the system. That is what I describe for the rest of this article.

Despite what we pretend to believe, computer programs are more than their denotational semantics. Hard as we try, there will never be a system in which λx. x and λx. log2(2x) are indistinguishable (over, say, the naturals) due to execution performance. The finiteness of memory and non-instantaneousness of computations are things we will always have to deal with. Using a system which ignores these issues as a fundamental backbone would be a wrong decision.

However, Dana’s mission is to be a purely functional operating system. A key feature of reasoning pure functional programming is that of extensionality: if two functions give the same outputs for the same inputs, they are the same function. So here, λx. x and λx. log2(2x) must be the same function. To throw that away would be to give into just another imprecise, operational notion of computer programs.

At first sight, these two points directly contradict each other. How do we reconcile them?

I rather loosely used the term “same” above. What I mean by that word is that, if two things are the same, then you can safely substitute one for the other in all contexts and not observe a difference. Again there is imprecision in this definition: what do I mean by “observe a difference”? And therein lies our solution.

The core calculus is not something which is executed—not seriously, at least. Instead, it’s a way to define your observations. We come back to our two function “λx. x” and “λx. log2(2x)”—mind the quotes. These are represented in our core calculus not as the functions themselves, but as terms in an abstract syntax tree, in which they are obviously not the same. Call them f and g, respectively. Elsewhere in the core, we might define functions on syntax trees, such as Meaning and Performance. Then we can prove that Meaning(f) = Meaning(g) but Performance(f) ≠ Performance(g). These functions make explicit the varying parameters that cause the two implementations to vary.

This syntax tree might then be passed along to an interpreter, which executes it, respecting the Meaning and Performance functions. This can be proven if the interpreter, too, is implemented on top of the core calculus. But eventually something must leave the system to get to the drivers, and we can’t use the core calculus to prove its correctness anymore. Many methods can be used to convince ourselves of the correctness of the driver, but we can’t do it within Dana—an unfortunate fundamental necessity.

The point is, though, that everything above the core can be proven correct, and it is all coherent: the code we are proving things about is the very same code we’re running, and we don’t have to worry about translation errors. If we can trust the drivers, then we can trust the code.

# Recursive Types in IΞ

In the last IΞ post, I introduced the calculus and sketched the construction of some standard mathematical objects. In this post, I will dive a little deeper and construct of all the positive recursive types. The proof will be in an informal style (in particular, omitting the H constraints), but I have little doubt that the proof will go through.

Only a superficial familiarity with IΞ is needed to understand this proof; I adopt a set-theoretic style notation, so the proof should be approachable. Here’s a quick refresher of the primitives of IΞ.

• Membership is application, so when I write $x \in A$, this is translated into IΞ as $A x$. Thus sets, unary predicates, and types are all the same thing.
• Because membership and application are identified, universal quantification and the subset relation are also. $\Xi A B$ means “A is a subset of B”, or “x in A implies x in B”. In particular, the pattern $\Xi A (\lambda a. P a)$ can be interpreted as “for every a in A, P a”.
• L is the set of all sets (whose existence does not lead to a contradiction in this system!). I will give its definition at the end of the article.
• Other symbols have their usual interpretation, and I’ll include an appendix which gives them all precise definitions at the end.

Definition: A function $F : L \mapsto L$ is called monotone if $A \subseteq B \Rightarrow F A \subseteq F B$.

Intuitively, the monotone functions correspond roughly to the functors in Haskell; they use their parameter in a positive way (appear to the left of an even number of arrows).

Definition: The type recursion combinator μ is defined as: $x \in \mu F = \forall P\!\in\!L.\, F P \subseteq P \Rightarrow x \in P$.

We are allowed to define things simply by giving a condition for membership like this. Formally, this definition starts out: $\mu = \lambda F. \lambda x. \Xi L (\lambda P. \dots)$

This definition intuitively says, x is in μ F if x is in every type closed under F. We will see that this definition corresponds to a type recursion combinator.

Lemma 1: If F is monotone, then $F (\mu F) \subseteq \mu F$.

Proof. Given $x \in F (\mu F)$; show $x \in \mu F$. Expanding the definition of μ above:

Given $P \in L, F P \subseteq P$; show $x \in P$.

Observe $\mu F \subseteq P$: Suppose $y \in \mu F$, show $y \in P$. Since $y \in \mu F$, we have $\forall P'\!\in\!L.\, F P' \subseteq P' \Rightarrow y \in P'$ by definition of μ. Letting $P' = P$, we have $F P \subseteq P$ from above, so $y \in P$.

Therefore, $x \in F (\mu F) \subseteq F P \subseteq P$ (by the monotonicity of F and $\mu F \subseteq P$). QED.

Now for the easy direction:

Lemma 2: If F is monotone, then $\mu F \subseteq F (\mu F)$.

Proof. Given $x \in \mu F$; show $x \in F (\mu F)$.

By the definition of μ, we have $\forall P\!\in\!L.\, F P \subseteq P \Rightarrow x \in P$. Let $P = F (\mu F)$. We have $F P = F (F (\mu F)) \subseteq F (\mu F) = P$ by monotonicity of F and Lemma 1, therefore $x \in P = F (\mu F)$. QED.

Which leads us to the recursion equation.

Theorem: If F is monotone, $\mu F = F (\mu F)$. (Proof trivial by the two lemmas)

I’m using set equality here, i.e. mutual containment, which is probably a bit weaker than Leibniz equality. It is obvious from the definition that this fixed point is minimal.

Monotonicity is easy to prove for any of the standard positive types (products, coproducts, functions). So we can model a good variety of Haskell data types already; however these are standard inductive (least fixed point) types, no infinite structures allowed. I’m still working on the encoding and analogous proofs for ν (greatest fixed point), which is closer to Haskell.

Hopefully I’ll be able to port many libraries (maybe after a few totality annotations) without having to worry about partiality. But there are more pressing matters, such as finishing my interactive proof assistant for IΞ.

Definitions

• $\text{True} = \Xi H H$, the True proposition.
• $K x y = x$, the constant function.
• $U = K \text{True}$, the set of all objects.
• $(f \circ g) x = f (g x)$, function composition.
• $(a \mapsto b) f = \Xi a (b \circ f)$, the set of functions from a to b.
• $L = U \mapsto H$, the set of all sets.
• $a \Rightarrow b = \Xi (Ka) (Kb)$, implication.
• $\Lambda f x = \Xi L (\lambda t. f t x)$, universal quantification of types (like forall in Haskell)
• $a \times b = \Lambda (\lambda r. (a \mapsto b \mapsto r) \mapsto r)$, product type.
• $a \oplus b = \Lambda (\lambda r. (a \mapsto r) \mapsto (b \mapsto r) \mapsto r)$, coproduct type.

# It is never safe to cheat

Anyone who has spent time trying to implement an FRP library knows the unsafePerformIO story. You may use unsafePerformIO as long as you ensure that the result maintains purely functional semantics. It’s possible to create impure values with unsafePerformIO. It is up to you to “prove” that you have created a pure one. Seems like a decent trade-off.

I put “prove” in quotes for a reason. If you’re doing something nontrivial (i.e. you’re not just using unsafePerformIO . return), you need an operational semantics for IO to prove this. But that’s not all! You are probably depending on some external state inside the unsafePerformIO, which depends on the time and order in which thunks are evaluated. But thunks aren’t part of the operational semantics of IO, they are part of the operational semantics of pure values in Haskell — something we quite explicitly do not have. So you need not only to embrace the ill-definedness of IO, but in fact tie yourself down to a particular operational interpretation of Haskell!

Let’s say I write an HNF evaluator for Haskell. Your unsafePerformIO magic will probably not work on this style of evaluator because the meaning of thunks — and the way they are executed — is quite different in this style.

There are more invariants on a Haskell function than purity and referential transparency. We can of course only implement computable functions. They have to be monotone and continuous. And they might be other things, as well, which someone one day will come along and prove by leveraging properties of the type system, exposed primivites, etc. (see ST for prior art). But they have not accounted for your magic, so their analysis does not include any program using your library.

We are pure functonal programmers. We have chosen a language which vastly restricts what we are allowed to do, because we understand the benefits we reap as a result. However, pretending to understand when we are allowed to cheat only buys us the benefits we know about now, but precludes future benefits from work in the field. By using unsafePerformIO — even in a safe way (or so you think) — you avert the exponential growth of our field.

If you find you can’t express something you feel you should be able to, I suggest one of two things: (1) look deeper until you find an incidental limitation of the language, and attempt to solve it at the language level, or (2) look deeper until you understand why you actually shouldn’t have been able to do that, revealing the truth from behind the curtain of zealous ignorance. In my experience, (2) is much more often successful.

On a more practical note, most of said limitations are about performance, which is not in the semantics’ domain of discourse. It makes sense that our languages wouldn’t be good at such things. Instead of introducing a hack, why not push the field forward and think about what a language which can talk about such things would look like? Each time you run into a limitation, you have a new use case, and thus a new perspective on the problem.

Until that problem is solved, though, your library users might have to pay the price of not having as elegant an interface. But by restricting yourself thusly, you are protecting yourself from your own ignorance, at least knowing that what you have made should, in fact, be makeable.

In summary: It is never safe to cheat.

# Lazy Partial Evaluation

Inspired by Dan Piponi’s latest post, I have been looking into partial evaluation. In particular, I thought that a language which emphasizes currying really ought be good at partial evaluation. In this post I describe some ideas regarding partial evaluation in functional languages, and later sketch a partial evaluation machine I devised.

Supercombinator reduction machines, like GHC, do a limited form of partial evaluation. I.e. when you compile a program to supercombinators, you are optimizing it for specialization from left to right. So if f is defined in pointful form, “let a = f x in (a z, a z)” might be a better program than “(f x z, f x z)”. This is a nice property of combinator reduction. Unfortunately, it doesn’t generalize: “let a = flip f x in (a y, a y)” will never be better than “(f y x, f y x)”, because functions only specialize from left to right. I conjecture that this missing feature is more important than we realize.

Mogensen gives a very elegant partial evaluator in pure lambda calculus, which optimize as expected with the Futamura projections (see Dan’s post). This partial evaluator works on higher order abstract syntax, taking and returning descriptions of terms rather than the terms themselves. Essentially all it is is (very simple) machinery describing how to evaluate under a lambda.

The system in that paper takes many precautions to avoid unfolding dynamic arguments, because otherwise the partial evaluator might not terminate. Apparently he is not well-versed in our Haskell School of the Infinite, because the evaluator is compositional. So what he means by “not terminate” is “return an infinite program”. But an infinite program is fine if you interpret/compile it lazily!

In fact, I believe (I am intuiting — I have done no formal checking) that the simplest-minded of lazy partial evaluators is perfect: it maximally unfolds its static arguments, there is no need for the type inference machinery in that paper, and it will have the same termination properties as the program. I attribute the ease of this task with the built-in metacircularity of lambda calculus.

Cool as a self-embedded partial evaluator is, to do efficient partial evaluation you need to keep quotations of your programs everywhere, then compile them before you actually use them. Lambda calculus is hinting at something more: that simply by applying one of several arguments to a curried function, you are specializing it. Wouldn’t it be great if every time you did that, the program were maximally specialized automatically?

### A partial evaluation reduction strategy

It turns out that such an ambitious wish is nothing more than an evaluation order for the lambda calculus. Admittedly, it’s not a very common one. You try to reduce the leftmost application, even under lambdas. We would also like to combine this with call-by-need, so when an argument is duplicated it is only reduced once.

Here’s an example program I’ve been working with, with the standard definitions of the combinators Ix = x and Kxy = x.

  flip (\x y. y I x) K K


It’s not much, but it gets the point across. Let’s look at it in call-by-name:

[1]  (\f a b. f b a) (\x y. y I x) K K
[2]  (\a b. (\x y. y I x) b a) K K
[3]  (\b. (\x y. y I x) b K) K
[4]  (\x y. y I x) K K
[5]  (\y. y I K) K
[6]  K I K
[6'] (\m n. m) I K
[7] (\n. I) K
[8]  I


Notice by the step [4] that we have lost the structure of flip (\x y. y I x) K, so any work we do from then on we will have to redo on subsequent applications of that function. Contrast this with the partial evaluation strategy:

[1]  (\f a b. f b a) (\x y. y I x) K K
[2]  (\a b. (\x y. y I x) b a) K K
[3]  (\a b. (\y. y I b) a) K K
[4]  (\a b. a I b) K K
[5]  (\b. K I b) K
[5'] (\b. (\m n. m) I b) K
[6]  (\b. (\n. I) b) K
[7]  (\b. I) K
[8]  I


We got the function all the way down to a constant function before it was finally applied.

One thing that’s interesting to notice about this strategy is that it seems stricter than call-by-name. That is, if you have a nonterminating term N, then reducing the application (\x y. Ny) z will loop, whereas it won’t in CBN. However, recall that in the domain theory, (\x. ⊥) = ⊥. The only thing you can do to a function to observe it is apply it, and whenever you apply this function you will loop. So you are bound to loop anyway if you are evaluating this application.

### The lazy partial evaluation machine (sketch)

Here is a sketch of an efficient low-level machine for this evaluation strategy. It is simple stack machine code (“which I pull out of my bum for now, but I don’t think an algorithm to generate it will be any trouble”). The only tricky bit about it is that pieces are sometimes removed from the middle of the stack, so it can’t necessarily be represented linearly in memory. A singly linked representation should work (I realize this costs performance).

The values in this language look like [v1,v2,…] { instr1 ; instr2 ; … }. They are closures, where the vn are pushed on to the stack in that order before the instructions are executed. Values can also be “indirections”, which point to an absolute position in the stack. Indirections represent a logical reordering of the stack, and are used to model bound variables. When indirections are executed, they remove themselves and execute (and remove) the thing they point to. The instructions are as follows.

  pop n      -- delete the entry at position n
dup n      -- push the entry at position n on the top of the stack
(other standard stack ops here)
abs n      -- push an indirection pointing to position n
-- (skipping other indirections) on top of the stack
exec n     -- pop and execute the closure at position n
closure [n1,n2,...] { instr1 ; instr2 }
-- remove positions n1,n2,... and add them
-- to the closure with the given code, and push it


dup is the only instruction which duplicates a value; this is where laziness will be encoded.

Let’s look at the code for each piece of our example program:

    I: [] { exec 0 }
K: [] { pop 1; exec 0 }


These are pretty straightforward. They both receive their arguments on the stack (the first argument in position 0, and downwards from there), reduce and continue execution. Recall the code is what happens when a value is forced, so every value ends with an exec to continue the execution.

    (\x y. y I x): [] { push I; exec 2 }


This one might be a little tricker. The stack comes in like this (position 0 on the left): x y, After push I, it looks like “I x y”, so y’s incoming stack will be “I x” just as it wants. Finally, the interesting one:

   (\f a b. f b a): [] { abs 2; exec 2 }


abs 2 pushes an indirection to the argument b onto the stack before calling f. This is how evaluation is pulled under binders; when an indirection is forced, it reduces the application at the appropriate level, and all below it. I am still not totally sure when to introduce an abs; my guess is you do it whenever a function would otherwise reorder its arguments. An example may demonstrate what I mean (but perhaps not; I haven’t totally wrapped my head around it myself).

Here’s an execution trace for the example above, with the stack growing to the left. I start with the instruction pointer on the first function and the three arguments on the stack. The stack shown is the state before the instruction on each row. An identifier followed by a colon marks a position in the stack for indirections:

 1 abs 1 (\x y. y I x) K K 2 abs 3 a (\x y. y I x) a:K K 3 exec 2 b a (\x y. y I x) a:K b:K 4 closure I b a a:K b:K 5 exec 2 I b a a:K b:K 6 pop 1 I b b:K 7 exec 0 I

When an indirection is executed, as in step 5, that is evaluation under a lambda.

This machine still doesn’t support laziness (though it didn’t matter in this example). We can achieve laziness by allocating a thunk when we dup. To evaluate the thunk, we put a mark on the stack. Whenever an instruction tries to reach beyond the mark, we capture the current stack and instruction pointer and jam it in a closure, then write that closure to the thunk. Indirections get replaced by their offsets; i.e. the “abs” commands that would create them. After we have done that, remove the mark (point it to where it was previously) and continue where we left off.

There you have it: my nifty partial evaluation machine. I’m reasonably confident that it’s correct, but I’m still not totally happy with the implementation of indirections — mostly the fact that you have to skip other indirections when you are pushing them. I wonder if there is a better way to get the same effect.

# Dana (actual) progress

I have some very exciting news! I wrote some actual code in the Dana repository. It is the Certificate module. That is, it’s an abstract data type Proof, such that only valid proofs in IΞ can be constructed.

The certificate module (IXi.Term and IXi.Proof together) is about 280 lines, which isn’t fantastically small, but isn’t too bad. This is especially considering that I don’t expect it to grow—it is perfect, modulo bugs.

Right now, Proof is just a proof checker function, but it’s designed so that I could swap it out with some serializable format, or even (crosses fingers) a dependent certified type.

One interesting development in this implementation is the new conversion certificate for lambda calculus with De Bruijn notation. That is, objects of type Conversion represent a valid βη conversion between terms. Previously, I hadn’t been satisfied with my solution: the certificate was implemented as a pair of convertible terms. This led to far too many equality comparisons of terms when combining and using certificates, which is both inefficient and I suspect would be hard to prove things about. Also, it required you to know too much about the goal you were proving, bloating the proof terms and tangling them with the theorems they were trying to prove.

The hardest part is β expansion, for example turning X into (\x.\y.x) X (\x.x). That was the reason for the pair representation: to do beta expansion, you just did a beta reduction and reversed it.

The new implementation instead implements conversions as partial functions. I.e. you give a conversion a source term, and it gives you an equivalent term (or says it couldn’t convert). This means I had to separately model beta reduction and beta expansion, because you can’t easily reverse a conversion. However, the solution is quite clean. I chose a basis of expansion combinators, which can be composed to form any expansion. They are:

    Identity : A → (\x. x) A
Constant : A → (\x. A) B    [x not free in A]
Apply : (\x. A) C ((\x. B) C) → (\x. A B) C
Lambda : \y. (\x. A) B → (\x. \y. A) B   [y not free in B]


This is in addition to the other combinators, which are needed to make this basis complete. They include β reduction, η expansion/contraction, and ways to focus a conversion on a subexpression. The key is that each combinator is correct by inspection, so we can be confident that the conversion algebra is sound.

I chose these combinators by thinking about what would be needed to construct the inverse conversion from bringing a term to normal form. If you’re familiar with SKI factorization, the process is pretty similar. Whenever you reduce an application (\x. A) B, you look at the structure of A and “push” B in by one level, applying one of these combinators. For example:

 Term Conversion (\f. \y. f (f y)) (\x. x) Lambda \y. (\f. f (f y)) (\x. x) inLambda Apply \y. (\f. f) (\x. x) ((\f. f y) (\x. x)) inLambda (inLeft Identity) \y. (\x. x) ((\f. f y) (\x. x)) inLambda Identity \y. (\f. f y) (\x. x) inLambda Apply \y. (\f. f) (\x. x) ((\f. y) (\x. x)) inLambda (inLeft Identity) \y. (\x. x) ((\f. y) (\x. x)) inLambda Identity \y. (\f. y) (\x. x) inLambda (Constant (\x. x)) \y. y

The reverse composition of the conversions on the right will bring us from \y. y to (\f. \y. f (f y)) (\x. x).

But isn’t it an awful pain to write all those combinators when proving things? Of course not! I make a computer do it for me. I have a little algorithm which takes two terms and computes a conversion between them, by bringing them both to normal form, and using the forward conversions one way and the inverse conversions the other way. Of course, if I give it terms which have no normal form it won’t halt, but the idea is that these terms are static: I use dummy terms to explain the conversion I want, and then apply the conversion I got back to the real terms (which may have subterms without normal forms).

So I say: get me from (\x y. A x) A I to (\x. x x) A, where “A” and “I” are just strings, and then I apply the conversion I got back to, say, (\x. \y. WW x) (WW) (\x. x), where WW has no normal form. The conversion still succeeds.

The certificate pattern shines here: my constructors are easy to verify, then I have a fairly involved algorithm for constructing certificates that is easy to use, which is guaranteed (at least one sort of) correct by construction.

So that’s fun stuff.

Proofs are still pretty tedious, however. My next step is to make some smart “tactic” combinators (which of course generate the underlying certificates) to make proofs easier. It shouldn’t take too long to make it at least tolerable. Then I’ll build up a library of certified infrastructure necessary for typechecking Haskell--, and finally write the compiler to complete the bootstrap. There are plenty of dragons to be slain along the way.

# Some Constructions in IΞ

Over the past couple months, I have been attempting to find a language to use as a core calculus for Dana, as anyone who follows this blog knows. I have been slowly converging on IΞ for its elegance and simplicity. It turns out to be very powerful, capable of modeling many structures while avoiding many of the paradoxes typically involved with those structures. In this post, I give an exposition of IΞ and construct some common structures.

The main idea is to use lambda calculus for all of our symbol manipulation needs, adding a constant Ξ for quantification. “Types” are unary predicates, so membership is application. For example, “Nat 0” is a proposition saying that 0 has type Nat (where 0 and Nat are given previous definitions somewhere).

ΞXY means “forall x, if X x, then Y x”, or, if you think of predicates as sets, simply “X is a subset of Y”. So if we have types Nat and Bool, we can say that f has type Nat → Bool with “ΞNat(\x. Bool (f x))”, read “forall x in Nat, f x is in Bool”. Very direct, is it not?

### System I

Interpreting these meanings directly, we arrive at Curry’s System I, whose rules follow. A proposition (sequent) has the form “Γ |- X”, which has the interpretation “assuming Γ, X is provable”.

Γ |- X   ⇐  X ∈ Γ
Γ |- Y   ⇐  Γ |- X  ;  X is βη convertible with Y
Γ |- YZ  ⇐  Γ |- XZ  ;  Γ |- ΞXY
Γ |- ΞXY ⇐  Γ,Xa |- Ya  ;  a does not occur free in X,Y,Γ


I have stated the rules as “to prove X, you need to prove Y”, because that’s kinda how my brain works. Take a moment to internalize them. They are obvious given the above intuitions, and shouldn’t be hard to read.

On top of this, we can build propositional calculus. Define “K = \x. \y. x”, and write “X ⇒ Y” as shorthand for “Ξ(KX)(KY)”. This system has the properties you would expect of a simple propositional calculus.

Sadly, this system is inconsistent. We can prove any proposition X:

  Let Y = (\x. x x ⇒ X) (\x. x x ⇒ X)
Observe that Y = Y ⇒ X.
[1] Y |- Y
[2] Y |- Y ⇒ X
[3] Y |- X      -- modus ponens on [1],[2]
[4] |- Y ⇒ X
[5] |- Y
[6] |- X        -- modus ponens on [4],[5]


Martin Bunder had a brilliant idea to block this paradox, and also many others, which brings us to the object of my infatuation:

### System IΞ

The crucial step above was [3]-[4], where we abstracted over the infinite proposition “((… ⇒ X) ⇒ X) ⇒ X”. The way we will block this is to only allow abstraction over finite propositions. Introduce a new symbol, H, such that HX means “X is a finite proposition” (or simply “X is a proposition”). We will derive finiteness from the finiteness of proofs: to prove HX, we first have to prove H of each component of X. Our system becomes (the new additions are in bold):

Γ |- X      ⇐  X ∈ Γ
Γ |- Y      ⇐  Γ |- X  ;  X is βη convertible with Y
Γ |- YZ     ⇐  Γ |- XZ  ;  Γ |- ΞXY
Γ |- ΞXY    ⇐  Γ |- H(Xa)  ;  Γ,Xa |- Ya     ;  a does not occur free in X,Y,Γ
Γ |- H(ΞXY) ⇐  Γ |- H(Xa)  ;  Γ,Xa |- H(Ya)  ;  a does not occur free in X,Y,Γ
Γ |- H(HX)


The final rule is an axiom, simply saying that “X is a proposition” is always a proposition.

### Constructing the Naturals

Now I will embark on constructing the type of naturals. Since types are predicates, I also need to come up with a representation for naturals. It turns out that it doesn’t matter what representation I use, as long as it has zero and a 1-1 successor function. For the sake of discussion, let’s use the Church encoding.

  0 = \f. \x. x
S n = \f. \x. f (n f x)


So a natural is an iteration function. For example, the number 3 iterates a function 3 times on its argument: 3 f x = f (f (f x)).

Coming up with a way to classify all of these, but not any others (such as infinity f x = f (f (f (f (f ...))))), was quite a challenge. You might try to classify these function on the property that they are an iteration function, but any sensible method of doing that ends up including infinity. I began thinking that IΞ was not strong enough, and looking for ways to enrich it by adding more axioms.

Fortunately, no new axioms are necessary! The encoding is obvious in retrospect. What is the first thing a mathematician thinks when you talk about the naturals: induction! Let’s define a natural as any object which you can do induction (with 0 and S) over.

To make this readable, we need to introduce a few more symbols:

    f . g = \x. f (g x)   -- composition
A → B = \f. ΞA(B . f)   -- the type of functions from A to B
True = ΞHH    -- the true proposition
U = K True    -- the type of all objects (mnemonic: universe)
L = U → H  -- the type of predicates/types


Now, for the naturals:

    Nat x = ΞL(\p. p 0  ⇒  ΞU(\n. p n ⇒ p (S n))  ⇒  p x)


Reading this in English: “x is a Natural if, for all predicates p, if p 0, and p n implies p (S n), then p x”. In other words, x is a natural if you can do induction up to it.

Exercise: prove |- Nat 0 and |- (Nat → Nat) S. Note that this can be done independent of our definitions of 0 and S.

### More Inductive Constructions

Using this scheme, we can construct all sorts of things. For example, equality:

    Eq x y = ΞL(\p. p x ⇒ p y)


Lists:

    nil = \n. \c. n
cons x xs = \n. \c. c x xs
List a x = ΞL(\p. p nil  ⇒  ΞU(\ys. p ys ⇒ Ξa(\y. p (cons y ys)))  ⇒  p x)


There is a classic paradox involving the inductive type Set = Set → H, which is definable using this scheme:

    Set x = ΞL(\p. Ξ(p → H)p ⇒ p x)


However, as I tried to prove the inconsistency, I was blocked by the H rules. This gives me hope.

### Coinductive Constructions

It is also possible to construct coinductive types. Here are the “conaturals”, the naturals with infinity. We can’t use the constructor definition anymore; coinductive types are all about projections. So the conaturals have only one projection, onto a disjunction. So to eliminate a conatural n, you pass it what to do if n = 0, and what to do with n’ if n = S n’. For example, to check if a conatural is zero, you can use n True (K False).

   CoNat x = ΞL(\p. Ξp(\y. y True p) ⇒ p x)


In English: x is a conatural if for all predicates p, if p y implies y True p, then p x. y True p can be interpreted as y = 0 or y = S n and p n.

Isn’t IΞ cool? Simple, obvious, but powerful enough to model a wide class of data (and codata). The thing I like best about it is that it is untyped at its core. Functions are just pure lambda calculus functions that we are “later” proving properties about. Type erasure comes for free (well, sortof: encoding dependent types into this system will end up passing types as parameters at runtime, even though they are never used).

### Future Concerns

Bunder proved his version of IΞ consistent and complete. But the rules I gave, and that I have been using, are not his, and in fact are more powerful than his. This makes it possible that my rules are inconsistent. My system can prove LL, while his cannot. This worries me, because LL in the typical interpretation means “Type : Type”, which gives rise to the Burali-Forti paradox. However, even though I was able to encode the infrastructure for Russell’s paradox, I was unable to complete the proof because of the H rules. Maybe the same thing will happen?

I’m spending a bit of time trying to understand the Burali-Forti paradox that gives rise to the inconsistency in Girard’s System U, so that I can try to carry it out in IΞ. If IΞ does turn out to be inconsistent, I am not worried. A few universe restrictions here and there (i.e. leave U unspecified, rather than defining as K T) should do the trick, at the expense of some convenience and beauty.

Also, since I intend to compile Haskell to IΞ, I need to talk about partiality somehow (eg. prove that if a function terminates, it has this type). This has been giving me trouble, but in a good way. I found that I don’t really understand how bottoms behave, and how to talk about them without implying their semantics. I’m confident there is a way, though, maybe after adding a few axioms as a compromise. But I need to think hard and understand bottoms better, so I know what ought to be true, before I try to prove them or add them as axioms.

# Might as well face it, I’m addicted to logic

I’m trying really hard not to become a logician. Like my obsession with FRP, it would be very interesting and educational. But my FRP fancy came from a desire to make games more easily, and I have since lost interest in that endeavor, studying FRP for its own sake. Now I am trying to change the world with Dana, and getting caught up in the beauty and unity of different logical systems.

This happened when trying to choose a core calculus for Dana. I am now furiously interested in Martin Bunder’s work on combinatory logic (btw, if anybody has a copy of his PhD thesis, “a one axiom set theory based on combinatory logic”, please let me know). System IΞ — or rather, systems nearby it — strike me as amazingly beautiful. It is based on an untyped lambda calculus, in which you can prove things about untyped functions (which is a way of endowing them with types). For example, to say that f has type A → B, you say:

Ξ A (B ∘ f)

In English: for all x in A, f x is in B.

However, the core logic really isn’t that important; I’ve only been focusing on it because it’s interesting. In fact, a cool thing about Dana is that there is very little dependency between its parts. But I would really like to start making something rather than researching. How come math is so fascinating?

Anyway, I am not sure that IΞ is strong enough. Assuming a “big enough” universe, I’ve been able to construct an equality predicate (the construction is essentially “the smallest reflexive relation”). But I have had little success in constructing any inductive types, such as the naturals. That’s why I want to read Bunder’s thesis — to get ideas.

Not just system IΞ, but logic in general, is fascinating me. Large cardinals in set theory, universe levels in CIC, and related “stratification” ideas abound and unify to create some intuitive notion of truth. In some sense, truth is the strongest consistent such unverse — however there is provably no such thing. In what system should we then work? Is it essential that picking a system of axioms in which to work will always be a part of mathematics? How do you consolidate results which assume different axioms?

That is actually my current goal for Dana’s core. I think the core calculus will be very weak, and you add axioms as you need more (in line with a quote from Dr. Scott himself: “If you want more you have to assume more”). Such axioms will have the same pattern as e.g. the IO monad in Haskell; your assumptions bubble their way to the top. However, it’s a much richer system than “IO or not IO”; you know exactly what you are assuming to run any piece of code. If there is a top level “user OS”, its assumptions will be vast (or maybe there’s some clever way to incrementally add them?).

Anyway, if the itch to make something irritates me so, I can assume I have a strong core logic — whatever it is — and start building things at a higher level. It’s emotionally difficult for me to do so, because I like to feel like I am on a strong foundation (isn’t that the whole point of Dana, after all?).

# System IG Semantics

I don’t have much time, so here’s just a quick jot. I don’t think System IG can prove this rather obvious proposition:

    |- GL(\a. Ga(Ka)) (\a x. x)


Or translated to more conventional notation: (\a x. x) : (a : Type) -> (a -> a), the type of the polymorphic identity function. In order to prove this, we need the axiom LL (conventionally, Type : Type). But that axiom is inconsistent. So I’ve tweaked the rules a bit to allow the above to be proven, without allowing LL. It’s possible that these tweaks are still consistent; they don’t admit Girard’s paradox, at least.

The rules I’m modifying can be found on page 8 of the above paper, in the IG box. My tweak is, changing rule Gi from:

    ?, Xx |- Yx(Zx)  ;  ? |- LX  ;  x not in FV(?,X,Y,Z)   =>   ? |- GXYZ


To:

    ?, Xx |- Yx(Zx)  ;  ? |- L(GXY)  ; x not in FV(?,X,Y,Z)  =>  ? |- GXYZ


I’m saying that you need to prove that GXY (conventionally, (x:X) -> Y x) is a type before you can say something has that type. Without more modifications, this is equivalent to the original. However, now I add two new rules:

    ? |- L(GL(KL))
? |- GL(KL)p  =>  ? |- L(GLp)


These say: (Type -> Type) : Type, and if p : Type -> Type then ((x:Type) -> p x) : Type. The latter one is the one I’ve heard is still consistent, but you can’t really do anything with it without the former. So, it needs further investigation.

Trying to prove the consistency of this new system, I need a semantics for terms. If I can model these semantics in CIC (the calculus used by Coq), then if CIC is consistent, so is this system. My idea for semantics is as follows:

Bare lambda terms don’t really have semantics; they are modeled by their syntax. But typing proofs have the semantics of whatever they are proving as well-typed. If I use [-] as the meaning function, then the semantics of [Lx] is a type (because it is a proof that x is a type). Formally:

    [Lx] : Type
[Gxyz] : (a:[Lx]) -> [Gx(KL)y] a


Simplified, the latter means that the semantics of a proof that x has type y -> z is a function from y to z. (Rather, it’s the dependent version of that). These semantics motivate the axioms I am adding.

Okay, that’s all the time I have!

# Certificate Design Pattern

When working the latest incarnation of my System IG compiler, I used a thingy which I now realize ought to be characterized as a design pattern. It substantially changed the way I was thinking about the code, which is what makes it interesting.

Summary: separate an algorithm into certificate constructors and a search algorithm.

A large class of algorithms can be considered, in some way, as search algorithms. It is given a problem and searches for a solution to that problem. For example, typically you wouldn’t phrase the quadratic formula as a search algorithm, but it is—it’s just a very smart, fast one. It is given a,b, and c and searches for a solution to the equation ax2 + bx + c = 0.

The certificate design pattern separates the algorithm into two modules: the certificate module and the algorithm. The certificate module provides constructors for solutions to the problem. For each correct solution, it is possible to construct a certificate, and it is impossible to construct a certificate for an incorrect solution. The certificate module for the quadratic formula algorithm might look like this:

module Certificate (Certificate, certify, solution) where

data Certificate = Certificate Double Double Double Double

certify :: Double -> Double -> Double -> Double -> Maybe Certificate
certify a b c x | a*x^2 + b*x + c == 0 = Just (Certificate a b c x)
| otherwise            = Nothing

solution :: Certificate -> (Double,Double,Double,Double)
solution (Certificate a b c x) = (a,b,c,x)


There is only one way to construct a Certificate, and that is to pass it a solution to the quadratic equation. If it is not actually a solution, a certificate cannot be constructed for it. This module is very easy to verify. The algorithm module is obvious:

module Algorithm (solve) where
import Certificate
import Data.Maybe (fromJust)

solve :: Double -> Double -> Double -> Certificate
solve a b c = fromJust $certify a b c ((-b + sqrt (b^2 - 4*a*c)) / (2*a))  Here, we use the quadratic formula and construct a certificate of its correctness. If we made a typo in the formula, then certify would return Nothing and we would get an error when we fromJust it (an error is justified in this case, rather than an exception, because we made a mistake when programming — it’s like an assert). The client to the algorithm gets a certificate back from solve, and can extract its solution. All the information needed to verify that the certificate is a correct certificate for the given problem should be provided. For example, if Certificate had only contained x instead of a,b,c,x, then we could have implemented solve like: solve a b c = certify 0 0 0 0  Because that is a valid solution, but we have not solved the problem. The client needs to be able to inspect that a,b,c match the input values. Maximally untrusting client code might look like this: unsafeSolve a b c = let (a',b',c',x) = solution (solve a b c) in assert (a == a' && b == b' && c == c') x where assert True x = x assert False _ = error "Assertion failed"  Here we can give any function whatsoever for solve, and we will never report an incorrect answer (replacing the incorrectness with a runtime error). This is certainly overkill for this example, but in the System IG compiler it makes a lot of sense. I have a small set of rules which form well-typed programs, and have put in much effort to prove this set of rules is consistent and complete. But I want to experiment with different interfaces, different inference algorithms, different optimizations, etc. So my Certificate implements combinators for each of the rules in my system, and all the different algorithms plug into that set of rules. So whenever I write a typechecker algorithm, if it finds a solution, the solution is correct by construction. This gives me a lot of freedom to play with different techniques. Verification rules can be more involved than this single function that constructs a certificate. In the System IG compiler, there are 12 construction rules, most of them taking other certificates as arguments (which would make them certificate “combinators”). I’ll show an example of more complex certificate constructors later. What is interesting about this pattern, aside from the added correctness and verification guarantees, is that is changed the way I thought while I was implementing the algorithm. Instead of being master of the computer, and telling it what to do, it was more like a puzzle I had to solve. In some ways it was harder, but I attribute that to redistributing the workload; it’s harder because I am forced to write code that is correct from the get-go, instead of accidentally introducing bugs and thinking I’m done. The other interesting mental change was that it often guided my solution. I would look at the certificate I’m trying to create, and see which constructors could create it. This gave me an idea of the information I was after. This information is the information necessary to convince the client that my solution is correct; I cannot proceed without it. Theoretically, the algorithm part could be completely generic. It might just do a generic search algorithm like Dijkstra. If it finds a certificate, then it has solved your problem correctly. Solutions for free! (But this will not be practical in most cases — it might not yield a correct algorithm by other criteria, such as “always halts”). Here’s an example of a more complex certificate. The domain is SK combinator calculus, and a Conversion is a certificate that holds two terms. If a Conversion can be constructed, then the two terms are convertible. module Conversion ( Term(..), Conversion , convId, convCompose, convFlip , convS, convK, convApp) where infixl 9 :* data Term = S | K | Term :* Term deriving (Eq) data Conversion = Term :<-> Term convTerms (a :<-> b) = (a,b) convId t = t :<-> t convCompose (a :<-> b) (b' :<-> c) | b == b' = Just$ a :<-> c
| otherwise = Nothing

convFlip (a :<-> b) = b :<-> a

convS (S :* x :* y :* z) = Just $(S :* x :* y :* z) :<-> (x :* z :* (y :* z)) convS _ = Nothing convK (K :* x :* y) = Just$ (K :* x :* y)  :<->  x
convK _ = Nothing

convApp (a :<-> b) (c :<-> d) = (a :* c) :<->  (b :* d)



The export list is key. If we had exported the (:<->) constructor, then it would be possible to create invalid conversions. The correctness of a certificate module is all about what it doesn’t export.

I’m wondering what the best way to present this as an object-oriented pattern is, so I can insert it into popular CS folklore (assuming it’s not already there ;-).