Perl 6 Rules: Elementary Compositionality and More Contrived Vocabulary

Larry Wall made a change to Synopsis 5, section “Longest-token matching” splitting the alternation operator | in two: | and ||. The idea is that || works just like the old one: try to match the left any way you can, and if that fails, then try the right. This makes sense to me, and the double-or name makes sense too (that’s how it works in perl code). On the other hand, the single | operator has some wildly strange semantics:

So for regex matching purposes we define we define token patterns as those patterns containing no whitespace that can be matched without side-effects or self-reference.

To that end, every regex in Perl 6 is required to be able to distinguish its “pure” patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the “pure” part of that regex, but not including any subrule more than once, since that would involve self reference). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically.

He’s going for two things here: raw speed and dwimming on the longest-token rule. I think he hit the first one pretty much smack on, and missed the second one, shooting himself in the ankle instead.

This longest token rule violates something which I just made up now called elementary compositionality. Even though it is nothing more than a figment of my mind at the moment, I believe it to be very important. It guarantees that you won’t get into trouble when you try to refactor; and let’s face it, refactoring is what modern software engineering is all about.

Let me try to define in very precise terms what elementary compositionality is. First, let’s define the language of a rule to be the set of all strings that it matches exactly. That is, $a ε Lang(rule) iff $a ~~ /^<rule>$/. Then, to borrow a term from model theory, two rules are elementarily equivalent if they have exactly the same language. Finally, elementary compositionality is a property of a matching strategy where: if a, a', b, and c are rules, where b is elementarily equivalent to c, and a' is obtained by replacing every occurence of b with c in a, then a is elementarily equivalent to a'.

Stated less formally: if you replace every call to a rule with a call to an elementarily equivalent rule, nothing changes.

Now for the pragmatics. Perl 6 rules are very rich, and we can’t expect elementary compositionality everywhere. For example, the <commit> directive violates it, and it can be violated if you rename your captures (another thing that bugs me: there is no encapsulation in rules, but that is a discussion for another day). But those are predictable somewhat. I mean, you added <commit> precisely to change large-scale matching semantics: that’s what it’s for. Also, people are accessing your $<reslut>, so if you call it your $<mother> instead, people aren’t going to be able to access it. The renaming thing is also fairly refactor-friendly, since you can rename and rebind captures as you wish. However, what is unpredictable (and also extremely hard to remedy) is the following: say you have a rule which matches any IPv4 address:

    token byte { <[0-1]> <digit>**{2} | 2 <[0-4]> <digit> | 25<[0-5]> }
    token ip { <byte> \. <byte> \. <byte> \. <byte> }

You realize what an ugly and unclear way to write byte that is, so you rewrite it:

    token byte { (<digit>**{3}) { fail unless $1 < 256 } }

Ahh, much nicer. Except one problem. The rule:

    rule literal { <float> | <ip> }

has just changed semantics! On the input 123.456.789.012, literal used to match the entire string as an IP, but now it matches 123.456 only as a float (then probably fails due to ratcheting when the grammar doesn’t accept .789 as the next thing).

Now, unfortunately, your little refactor has just disabled an optimization. The early, ugly byte was expressed as a regular expression, so it could be combined with ip the other common prefixes of literal (and sibling rules) into a DFA. The new, pretty byte cannot be compiled that way. I think this is one of the reasons Larry went with the “automatically detect tokens” approach. But I don’t think this is sufficient reason to violate elementary compositionality everywhere (to drive home the point: run this example in reverse, and in trying to optimize your grammar, you have also changed semantics).

The solution I have in mind at the moment, and I admit it is not totally optimal, is to have the user specify what he thinks the tokens are, so that the compiler can really match the longest token, not the longest “token-looking-ish thing”. The reason that is not optimal is because sometimes it is nice to inline syntax into rules without having to go through the token abstraction, and we’re not really sure which ones of those to call tokens. Maybe we could make up a simple heuristic, such as all constant strings are considered tokens (at the rule level, not the transitive closure level), which isn’t too different from the first version of the | / || separation. But the point is that I should be able to take control from Perl’s dwimmery, say “no, you’re wrong”, and ensure that I can refactor without fear from then on.

About these ads

3 thoughts on “Perl 6 Rules: Elementary Compositionality and More Contrived Vocabulary

  1. To make it a bit easier to write composable tokens, assertions are now considered invisible to longest-token processing, so if you write your assertion above as <?{ $0 < 256 }> it should still compose properly. Bare closures are still “showstoppers” though.

  2. On the input 123.456.789.012, literal used to match the entire string as an IP

    I don’t see why. Wouldn’t the early, ugly byte also disallow 456?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s