On the By functions

Here’s a quick little note that has been bugging me for a while, and nobody wants to talk about it right now on IRC.

I think the By functions:

sortBy :: (a -> a -> Ordering) -> [a] -> [a]
maximumBy :: (a -> a -> Ordering) -> [a] -> a
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
nubBy :: (a -> a -> Bool) -> [a] -> [a]

Etc. should be replaced by On functions:

sortOn :: (Ord b) => (a -> b) -> [a] -> [a]
maximumOn :: (Ord b) => (a -> b) -> [a] -> a
groupOn :: (Eq b) => (a -> b) -> [a] -> [[a]]
nubOn :: (Eq b) => (a -> b) -> [a] -> [a]

My argument is: the functions provided to sortBy etc. have some preconditions. sortBy is not well-defined (or oughtn’t be) for functions which are not linear ordering functions; nubBy is shouldn’t be well-defined (to the dismay of some 31337s) for functions which do not encode an equivalence relation. But the folklore is that functions are typically “as total as possible”, so if it wants a function of some type, all I have to do is conjure a function of that type and my output will be something reasonable in terms of that function.

On the other hand, the folklore of typeclasses is that they typically come with laws. You need to prove — or at least think you know how to prove — some laws when you make a type an instance of a typeclass. The On functions use this obligation to encode their precondition. They are easier to use in a verified setting, too; there are a bunch of standard instances of Eq and Ord for which the laws are known to hold; map your data on to that in any way you like and the precondition is guaranteed.

Thoughts?

About these ads

21 thoughts on “On the By functions

  1. It’s easy to define sortOn in terms of sortBy, etc, so I prefer sortBy. I do take your point about preconditions, but it is actually quite useful sometimes to abuse the fact that the preconditions don’t need to be met. I suppose the proper way to deal with this is to define what the weaker preconditions are. Failing this though, I’d rather have the option to join in the abuse rather than be forced to define my own sort.

  2. I don’t know about the social dynamics of your suggestion, but this does seem a lot like the ‘cmp’ vs ‘key’ parameters to Python’s sorted() built-in. I think that I find ‘key’ more useful than ‘cmp’, but I use them both.

  3. I think the main argument for the By functions is that the function arguments can be built at runtime, while the class-based On approach is more static.

    Imagine you’ve got a GUI application that allows the user to sort a table by several columns in a particular order, with both ascending and descending options on each column. With sortBy, that’s just (comparing name `mappend` flip (comparing age) `mappend` …).

    The code required to achieve this with sortOn is quite involved.

  4. A nice trick I’ve read somewhere on a haskell blog (I’ve forgotten where exactly, unfortunately):

    on :: (a -> a-> b) -> (a -> c) -> (c -> c -> b)
    on f g a b = f (g a) (g b)
    

    This leads to the almost-as-nice

    sortOn f = sortBy (compare `on` f)
    
  5. Well, the obvious limitation of one Ord instance per type b springs to mind. I would hate to have to define a new type and Ord instance for each different sort order needed. FWIW, I think having xxxOn functions *in addition to* the already existing xxxBy functions would still be useful for the simple reason that it’s probably the common case.

  6. Reply to Spencer:

    > Imagine you’ve got a GUI application that allows the user to
    > sort a table by several columns in a particular order, with both
    > ascending and descending options on each column. With sortBy,
    > that’s just (comparing name `mappend` flip (comparing age)
    > `mappend` …).

    Easy with sortOn as well:

    sortOn (name &&& (Reverse . age) &&& …)

    which uses

    newtype Reverse a = Reverse a
    instance Ord a => Ord (Reverse x) where
    compare (Reverse a) (Reverse b) = compare b a

    and the existing Ord instance for pairs.

  7. One thing to keep in mind is that ‘nubBy’ is a pretty funny name for a function. ‘nubOn’ is not nearly as funny.

  8. This does sound like the cmp vs key debate in Python. Using key (or sortOn) is most straightforwardly implemented with the decorate-sort-undecorate scheme, which avoids some calls to “by”‘s comparison function but allocates a bunch of extra storage. It’s not so clear in Python how to do a complicated comparison with “key” other than by defining a new class and allocating a lot of instances and taking method-call overhead. E.g., say you are sorting trees, and the comparison function involves first comparing the root nodes, then recursively descending both trees if the roots are equal. I guess in Haskell you can define a new type with an Ord instance like Bardur says, which is a little bit messy but doesn’t take as big a performance hit as the Python approach.

    Overall I agree with Bardur that it would be best to support both interfaces.

  9. I think that while the -On functions are something which should be included in the library (probably along with something that optimises for that case by avoiding recomputation of the things being compared), the -By functions have more general uses.

    I advocate removal of the “preconditions” for groupBy and nubBy. They have specifiable and natural behaviour for non-equivalence-relations, and that behaviour should just be specified.

    In particular, nubBy f xs constructs the list of elements of xs such that if x and y are in the list with x occurring before y, then f x y is False, and such that the sequence of indices of selected elements is lexicographically minimal.

    groupBy f xs constructs a list of groups of elements of xs such that concat (groupBy f xs) == xs, and if x is the head of some group, and y any other element of that group, then f x y is True, and such that the sequence of lengths of the groups is lexicographically maximal.

    I often have found it convenient to rely on this behaviour. The behaviour of nubBy in particular is a natural sort of general sieving algorithm. I think it’s extremely annoying that nubBy got broken in recent versions of GHC (hopefully this will get fixed in the next release). The parameters to the function got swapped so that they are the reverse of the order that the values occur in the list.

    The behaviour of groupBy is ideal for example in constructing a tree from a sort of preorder traversal list. Given a function which determines the desired depths of the elements and compares those, groupBy will construct exactly the groups of elements which should be converted to subtrees.

    The sortBy and maximumBy functions have the nice property that functions of type a -> a -> Ordering are an instance of Monoid, and so comparison functions can be composed nicely with mappend, where the second comparator is used to refine the comparison in cases where the first gives EQ. Mind you, a similar effect can often be achieved using tuples with the sortOn functions, supposing that appropriate types to project to for comparison can be found in the first place.

  10. For what it’s worth I agree with Luke. If a function of sortBy’s signature is really desired, then it ought in principle to be well-defined for more than just total orders. In which case it should be called something that its properly expresses its behavior in all cases, which I don’t think sortBy does. Readable code, IMO, pretty much requires that the purpose to which each function is put be a special case of the purpose expressed by its name.

  11. That’s interesting: all these arguments are fine, but seem to be using Reverse on their cost function wrt. mine.

    Most of you guys are saying that sortBy etc. are more general than sortOn etc. This is a fact, and it is precisely why I’m advocating sortOn. sortBy is too general — it allows us, as Astrolabe says, to abuse the preconditions. Haskell’s isn’t a culture, as far as I observed, to value such things.

    Antranig Basman told me a story about a C++ project he was working on, in which he passed a almost-but-not-quite ordering relation to a standard library function. The program would run flawlessly for hours, and then suddenly crash and overwrite the stack. The bug took a full week’s work to track down. It turns out that the algorithm was relying on that ordering relation in a very clever and essential way, and when that precondition was violated it caused the algorithm to enter an “impossible condition” causing the crash.

    It has happened in the past that profiling reveals a better implementation of sortBy than was previously used. Relying on the “degenerate behavior” will either (1) break your code when such a change happens, or (2) prevent such a change for fear of breaking people’s code (like yours). I would rather have the entire community benefit from the speed increase…

    Cale, you have a good point, though I would be more comfortable if the generalized versions were separated into distinct concepts. Eg. rename nubBy to sieve, and formalize nubOn merely and precisely as a way to remove duplicate elements.

  12. I like the fact that the -On versions of the functions can usually be implemented with a
    “Schwartzian transform” to improve their performance for expensive transformations. So for that I whole heartedly support the addition of the -On functions to our standard repetoire.

    As an interesting aside, you can still derive the existing By behavior from the On behavior by defining instances of Eq and Ord that violate the implied laws for those type classes. Two approaches come to mind. One approach without overhead uses reflection to make sure that you are consistently partially applying the same function:

    module By where

    import Data.List (sort, sortBy)
    import Data.Reflection

    – feel free to replace with the Schwartzian transform
    sortOn :: Ord b ⇒ (a → b) → [a] → [a]
    sortOn f = sortBy (compare `on` f) where on f g a b = g a `f` g b

    newtype EqBy s a = EqBy a
    instance (s `Reflects` (a → a → Bool)) ⇒ Eq (EqBy s a) where EqBy a ≡ EqBy b = reflect (undefined :: s) a b

    newtype OrdBy s a = OrdBy a
    instance (s `Reflects` (a → a → Ordering)) ⇒ Eq (OrdBy s a) where a ≡ b = compare a b ≡ EQ
    instance (s `Reflects` (a → a → Ordering)) ⇒ Ord (OrdBy s a) where OrdBy a `compare` OrdBy b = reflect (undefined :: s) a b

    sortBy1 :: (a → a → Ordering) → [a] → [a]
    sortBy1 f as = reify f (λs → sortOn (OrdBy `withOrd` s) as) where
    withOrd :: (a → OrdBy s a) → s → a → OrdBy s a
    withOrd = const

    – And then there is the more portable, permissive, and readily readable (but more expensive) approach:

    data EqBy2 a = EqBy2 (a → Bool) a
    instance Eq (EqBy2 a) where EqBy2 f _ ≡ EqBy2 _ x = f x

    data OrdBy2 a = OrdBy2 (a → Ordering) a
    instance Eq (OrdBy2 a) where a ≡ b = compare a b ≡ EQ
    instance Ord (OrdBy2 a) where OrdBy2 f _ `compare` OrdBy2 _ x = f x

    sortBy2 :: (a → a → Ordering) → [a] → [a]
    sortBy2 f = sortOn (λx → OrdBy2 (f x) x)

    I find it hard to advocate those approaches, however.

    In the end I think adding the ‘On’ methods is a good idea, if only because they provide concrete performance benefits and I have at least reinvented a Schwartzian sortOn on a couple of occasions, but I can’t see removing the By functions. They are entrenched in untold quantities of user code scattered throughout hackage.

  13. The only thing I would be concerned about is whether it’s still possible to express the following extremely useful idiom:

    sortBy (comparing f `mappend` comparing g)

    but it seems an equivalent version in terms of sortOn is easily obtained:

    sortOn (f &&& g)

    and… dare I say it… it’s prettier than the first option!

  14. Glad to see someone thinking the same. For the sort of things I do (program derivation), I did find the *On variants more natural to me than the *By counterparts. For instance, some typical examples of program derivation I show to people involve returning lists of maximum length. In such situation it is handy to just say “maximumOn length” rather than having to define a comparison function.

    Further, the *On family have some nice properties I use often, e.g.

    maximumOn f . map g = g . maximumOn (f . g)

    for total functions.

    Of course one can define *On in terms of *By. But the point is that I use the former much more than the latter that I’d like to have them in some library.

    However, I don’t do much “practical” programming, so I am not sure that others would share my experience. :)

  15. Note, the function
    sortOn :: (Ord b) => (a -> b) -> [a] -> [a]
    is already available and is called GHC.Exts.sortWith.

  16. What I’d quite like would be an “equating” function which does the same thing as “comparing” (which is (compare `on`)). I’ll grant that ((==) `on`) has roughly the same keystroke count, but “equating” is certainly more expressive and less noisy.

  17. Note that there is a second argument for sortOn to replace sortBy, namely that the majority of uses of the latter are actually of the form

    sortBy (comparing f)

    for some f.

  18. @Luke

    On cultural debates: there’s semantic shear going on here, so y’all’re arguing at cross purposes. You view the “prerequisites” as the primary definition of these functions, and want the implementation and type to enforce them; this is the Haskell way. Cale views the implementation for what it is, taking the “prerequisites” as an epiphenomenon of constraining the more general function, and wants to make use of the generalization; this too is the Haskell way. In terms of culture, neither invariant-based nor generalization-based programming is “more right”, or “more Haskell”. Without invariants, equational reasoning goes out the window; without generalization the worlds of monads and MPTCs wouldn’t have been found. Both of these traits are essential to the Haskell community.

    As for the current debate, I agree with most of the arguments on both sides. The restricted versions make equational reasoning and optimization easier, but they run headlong into the one-typeclass-per-type problem, etc. The generalized versions are quite interesting in their own right, but these interpretations aren’t documented in the modules and so they must be re-discovered by people like Cale and broken by minor GHC patches. So long as the documentation is updated, I see no reason not to offer both families.

    Perhaps sortBy is a bit strange, but sortOn == sortBy . comparing == sortBy . on compare; and all the other pieces there do make quite a lot of sense for generalizing/optimizing sorting. Certainly nubBy and groupBy are quite sane and should be kept, whatever they’re named. While we’re changing things, I’d like to see a general implementation of decorate-sort-undecorate/Schwartzian-transform provided which has the same type and semantics as the normal generic sorting algorithm. DSU isn’t always an optimization because of the bookkeeping, but people shouldn’t have to reinvent the idea over and again, like they do.

  19. “there are a bunch of standard instances of Eq and Ord for which the laws are known to hold”…

    unfortunately there are also standard instances for which the laws are known NOT to hold, like Float and Double!

  20. I think the proposal is a clever one. Most of the arguments that are not in favor seem to miss the implicit spirit of your motivation.

    sortOn will reject more erroneous programs much the same way as sortBy rejects more incorrect programs than an imperative sortIO:

    sortIO :: (a -> a -> IO Ordering) -> [a] -> IO [a]

    I think it is part of the job of a library designer to discover a nice balance between rejecting erroneous programs while making it simple to write useful ones.

    Delegating the proof to an obligation of a pre-existing type class is an interesting idea.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s