I have recently become fascinated with the concept of maximum entropy distributions, and went back and read Dan Piponi’s post on negative probabilities, and link surfing from there. Something sparked and I wondered what kind of connection there is between the two. A little experimenting in Mathematica later and I’m on to something curious.

First, a little background. E.T. Jaynes argues (so I have heard, I have not read the original) that if you have a set of constraints on a set of random variables and you would like a probability distribution over those variables, you should choose the distribution that has the most information entropy, as this is the “least biased” distribution.

The entropy of a distribution is defined as: .

I am using Dan’s example, and I will quickly recapitulate the situation. You have a machine that produces boxes of ordered pairs of bits. It is possible to look at only one bit of the pair at a time, say each bit is in its own little box. You do an experiment where you look at all the first bits of the boxes, and it always comes out 1. You do a second experiment where you look at the second bit of the boxes, and it, too always comes out 1.

Now, most reasonable people would draw the conclusion that the machine only produces boxes containing “1,1″. However, if we wholeheartedly believe in Jaynes’s principle, we have to look deeper before drawing a conclusion like that.

The 4 probabilities we are interested in correspond to “0,0″, “0,1″, “1,0″, “1,1″. I will write them as 4-vectors in that order. So an equal chance of getting any combination is written as 1/4 <1,1,1,1>.

For the distribution <a,b,c,d>, our constraints are: a+b+c+d = 1 (claiming our basis is complete), c+d = 1 (the first bit is always 1), b+d = 1 (the second bit is always 1).

The “reasonable” distribution is <0,0,0,1>, which indeed satisfies these constraints. The entropy of this distribution 0 (taking x log x = 0 when x = 0) — of course, there is no uncertainty here. But are there more distributions which satisfy the constraints?

Well, if you require all the probabilities to be positive, then no, that is the maximal entropy one, because it is the only one that satisfies the constraints. But let’s be open-minded and lift that requirement.

We have to talk about what the entropy of a negative probability is, because log isn’t defined there. The real part is perfectly well defined, and the imaginary part is multi-valued with period 2π. I’m not experienced enough with this stuff to make the right decision, so I’m blindly taking the real part for now and pretending the imaginary part is 0, since there’s really no reasonable “magnitude” it could be.

Whew, okay, almost to the fun stuff. We have four variables and three constraints, so we have only 1 degree of freedom, which is a lot easier to analyze than 4. We can express the distribution with only that one degree *d* as:

<d-1, 1-d, 1-d, d>

And here is a plot of the real part of the entropy as a function of *d*:

It achieves a maximum at d = 1/2, the distribution <-1/2, 1/2, 1/2, 1/2>, the same one Dan gave. In some sense, after observing that the first box is always 1 and, separately, that the second box is always 1, it is *too biased* to conclude that the output is always “1,1″.

I would like to patch up the “real part” hack in this argument. But more so, these exotic probability theories aren’t really doing it for me. I would like to understand what kinds of systems give rise to them (and how that means you must interpret probability). My current line of questioning: is the assumption that probabilities are always greater than 0 connected to the assumption that objects have an intensional identity?

I would love to hear comments about this!

E.T. Jaynes himself would probably reject negative probabilities as nonsense, for he

derivesthe rules of probability as we know them from more basic principles. See also Laplace’s model of common sense on his (posthumous) homepage.Also, it’s not clear what maximizing the entropy means once it takes complex values. Just taking the real part somehow doesn’t cut it, not to mention that it’s not clear whether the current formula is the right generalization to negative numbers.

Last but not least, feel free to ignore my skeptical remarks. :-)

I can’t say too much about the negative probability interpretation of quantum, but there is a different(?) thing which is called “quantum probability”…

If we start from just Bayesian probability theory, we can interpret the random variables geometrically as axes of a high-dimensional space. The universal set of elements becomes a vector space spanned by eigenvectors; events become subspaces (rather than subsets); and, a state vector assigns probability to events (rather than a function).

But one of the assumptions lurking here is the idea that all events are “compatible”, which is to say that a single feature set describes all events (or that all eigenvectors are orthogonal). This is nice and gives us what we’d expect from classical probability theory. However, in QM we have incompatible events, where different kinds of events are described by different feature sets. And this can be applied to extending the Bayesian framework as well.

One of the side effects of introducing incompatible events is that we start getting path effects. In classical Bayesian probability theory we have P(x)*P(y|x) == P(y)*P(x|y) whereas in quantum probability theory that need not be the case. There are also effects where P(x,y) can be greater than P(x) or P(y) alone, and so forth. Jerome Busemeyer has done some research indicating that QP is a good model for the kinds of “errors” people have from naive intuition about probability. In general, it seems to model interference, recency, and ordering effects quite well and makes predictions about how our intuitions work.

One of the big issues here does come down to what “probability” really means. When human judgments are measured by the criterion of Bayesian probability, they have errors. But when measured by the criterion of QP, they’re correct. Since Bayesian theory is a proper subset of QP (namely when all events are compatible), this raises the question of which is more “right”. The next question, of course, is right for what purpose? As a cognitive issue, it makes sense for us to use QP because that can minimize the storage space for memorizing events (since the feature sets can be smaller, due to them only needing to cover one kind of events) and it is more suitable for building probabilistic models from insufficient data. But as a physical or metaphysical matter, I’m not sure where this takes us. I’ve only recently learned about QP to start thinking about it.

http://www.whiteninjacomics.com/images/comics/represented.gif <- this amused me in the context of the toy example given on aNoI :P

I’m thinking that maybe, just maybe, negative probabilities make sense in the same situations in which negative money makes sense… you can’t actually have negative of either, but in situations in which your actual quantity is a cumulative sum, it makes sense…

Why do the probabilities sum to 1, anyway? It’s because we expect 1 event to happen when we sample from the distribution, right? If we think of negative events as something un-happening (ie, being subtracted in some way from reality) then perhaps things make sense if we expect 1 event “total” to happen… meaning several negative events and several positive evetns happen, but it all sums to 1.

We require that the probability of *observable* events each sum to positive numbers… does this somehow guarantee that we’ll never observe an actual negative happening? (Would we be erased if we saw such a thing?)

This guy has some interesting articles. He talks about quantum probability and how they are what happens when you use the 2-norm instead of the 1-norm.

http://www.scottaaronson.com/democritus/lec9.html

(It’s too bad he holds Penrose’s crackpot philosophy in such high regard).