Skip to content

Updating forecasts

For the next variation of the dice game, I want to model the ways players might rationally update their beliefs, and therefore their predictions, as the dice game proceeds. To discuss this, I need to expand the language a bit. Before, I said that player A believed absolutely and with 100% confidence that GUD-A is correct while GUD-B is incorrect. Neither player changed their mind or wavered in their confidence as the game proceeded. Now I want to imagine a situation in which, while player A favors GUD-A and disfavors GUD-B as the most likely theory, it is not with 100% confidence. In order keep things clear, from now on I will refer to GUD-\alpha and GUD-\beta rather than GUD-A and GUD-B so that the labels for the theory are distinct from the labels for the players. Then I can say, for example, that player-A believes in GUD-\alpha with 99% confidence while reserving 1% for the possibility that GUD-\beta is correct. Likewise, player-B might be 90% certain that GUD-\beta is correct, leaving 10% for the probability that GUD-\alpha is correct.

To continue to keep things as simple as possible, I will return to the case where there are only two players. Also, I will assume that exactly one and only one of the two theories (\alpha or \beta) is correct. Then,

P_j(\alpha) + P_j(\beta) = 1 \, , \qquad P_j(\bar{\beta}) = P_j(\alpha) \, , \qquad P_j(\bar{\alpha}) = P_j(\beta) \, .

The lines on top of \bar{\alpha} and \bar{\beta} in P_j(\bar{\alpha}) and P_j(\bar{\beta}) mean these are the probabilities that \alpha and \beta are not true.

The actual probability that player j assigns for a roll to come up “yes” is the probability for “yes” under the condition that theory \alpha is true plus the probability for coming up “yes” under the condition that theory \beta is true:

P_{j}(y) = P_j(y|\alpha)P_{j}(\alpha) + P_j(y|\beta)P_{j}(\beta)

= P_j(y|\alpha)P_{j}(\alpha) + P_j(y|\beta) \left[ 1 - P_j(\alpha) \right] \, . \qquad \qquad    \text{(1)}

Here, y and n stand for “yes” and “no” respectively. I will continue to assume, for purposes of illustration, that GUD-\alpha gives a 1/3 probability for rolling y and GUD-\beta gives a 5/6 probability for y. Then, for both players A and B,

P_j(y|\alpha) = 1/3 \, , \qquad P_j(y|\beta) = 5/6 \, .

At the outset, a player has an initial subjective probability P_j(\alpha) for theory \alpha to be true in addition to the conditional probabilities P_j(y|\alpha) and P_j(y|\beta) for a yes roll. However, a reasonable player will surely update the strength of their belief in \alpha and/or \beta after observing the outcome of the first roll. Saying this with the symbols above, they will determine conditional probabilities P_j(\alpha|y) and P_j(\alpha|n) for theory \alpha in the event of a “yes” or “no” roll. One way they might do this is with Bayes’ theorem, which inverts a conditional probability like P_j(y|\alpha) (probability for a “yes” roll given the theory) to get P_j(\alpha|y) (probability for the theory given a “yes” roll). In its simplest form, Bayes’ formula is

P_j(\alpha|y) = \frac{P_j(y|\alpha)}{P_j(y)} P_j(\alpha) \, .

There are very many good overviews of the Bayes’ theorem, so I will not try to give one here. The Wikipedia page is a good place to start. My aim here is just to discuss how we might use it to model belief updating. After a bit of rearrangement, I can rewrite Bayes’ theorem for our situation in the form,


P_j(\alpha|y) = \frac{P_j(\alpha)}{L_j(y,\alpha) \left[ 1 - P_j(\alpha) \right] + P_j(\alpha)} \, ,\qquad \qquad  \qquad  \text{(2)}

where

L_j(y,\alpha) \equiv \frac{P_j(y|\bar{\alpha})}{P_j(y|\alpha)} = \frac{P_j(y|\beta)}{P_j(y|\alpha)} \,

is what I will call a likelihood ratio. An exactly analogous set of equations applies when y \leftrightarrow n. Since one and only one of the theories must be true in the current version of the game, I can calculate P_j(\beta|y) once I know P_j(\alpha|y) simply from

P_j(\beta|y) = 1 - P_j(\alpha|y) \, .

Now say that the die is to be rolled a second time. We then have to ask what prediction a player should make now for the die to roll “yes.” According to equation (1), it is determined by a combination of the conditional probabilities P_j(y|\alpha) and P_j(y|\beta) and the players’ overall belief that theory \alpha or \beta are correct. But the players have just updated their beliefs based on the result of the first roll of the die, so now equation (1) should be used with the P_j(\alpha|y) (or P_j(\alpha|n)) from equation (2) in place of P_j(\alpha), depending on whether it was “yes” or “no” that rolled up.

After the second roll of the die, the players then register the result and update their beliefs in theories \alpha and \beta once again using equation (2) in another iteration. With each roll of the die, each player gets a new P_j(y) using the P_j(\alpha|y) or P_j(\alpha|n) from the previous roll as their new prior probability.

I express this procedure in equations by adding another index i to P_j to label the number of rolls of the die. Then, for all i \geq 1

P_{i,j}(\alpha) = \begin{cases} P_{i-1,j}(\alpha|y) = \frac{P_{i-1,j}(\alpha)}{L_j(y,\alpha)\left[1 - P_{i-1,j}(\alpha) \right] + P_{i-1,j}(\alpha)} \,& \text{if} \;\; \text{roll} = y \\ P_{i-1,j}(\alpha|n) = \frac{P_{i-1,j}(\alpha)}{L_j(n,\alpha)\left[1 - P_{i-1,j}(\alpha) \right] + P_{i-1,j}(\alpha)}\, & \text{if} \;\; \text{roll} = n \, \end{cases} \, ,


P_{i,j}(\beta) = 1 - P_{i,j}(\alpha) \, .

Equation (1) for the forecast probability also needs an i index,


P_{i,j}(y) = P_j(y|\alpha)P_{i,j}(\alpha) + P_{i,j}(y|\beta)P_{i,j}(\beta)

= P_{i,j}(y|\alpha)P_{i,j}(\alpha) + P_{i,j}(y|\beta) \left[ 1 - P_{i,j}(\alpha) \right] \, .

After many rolls of the die, each player’s belief converges either to \alpha or \beta, depending on which theory is the correct one.

It is easy to implement the equations above in a simple simulation. Below are plots of each player’s level of belief in theory \alpha and \beta over the course of 30 questions. Player A here starts with an 80% belief in theory $\alpha$ while player B starts with a 1% belief in \alpha. The plots are for the case that \alpha is actually true.

After roughly 22 rolls, both players have converged on nearly 100% certain belief in theory \alpha.

Of course, this is all far too simplistic for the long run. For one, we need to combine belief updating with a tendency for players to disagree about the outcomes of rolls, and many participants need to be incorporated somehow. Also, the proxy for experimental measurements or observations needs to be something much more complex than just dice rolls with regular probabilities. Still, the above provides a reasonably simple demonstration of one way that participants might update their beliefs in response to new evidence.

A Dice Game: Part 2

The dice game I described in my previous post is helpful for illustrating concepts like surprisal and for demonstrating how the reward distribution system works. To keep things simple, for now I will assume that players A and B always agree on the outcome of each roll, so |V| is always 1.

The maximum possible \Delta s occurs when the die rolls “no” because expert B gave only a 1/6 chance for “no.” In that case, the range of surprisals is a relatively large \Delta s \approx 0.69. The number of players is only 2 so the maximum net total number of reward points distributed after a roll is

r_\text{total} \approx 2 |V| \Delta s^2 \leq 0.96 \, .

I will assume that player A has the correct GUD and that 12 dice rolling experiments take place. The sequence of rolls might be, for example,


yes , yes , yes , yes , no , no , no , no , no , yes , yes , no


Since the players always agree on the result of each roll, |V| = 1 always. The value of q, the “outcome,” is always +1 or -1 depending on whether a “yes” or “no” rolls up. After the 12 rolls, player A has a mean surprisal of \langle s \rangle \approx 0.75 and player B has \langle s \rangle \approx 0.99. Player A has accumulated 5.8 reward points, while player B has accumulated only 2.5. This is easy to check using a calculator and Eqs.(3,11) from here.

A slightly more interesting illustration comes from increasing the number of players to 5 and increasing the number of dice rolls to 500, thus making statistical trends clearer. Players A and B still forecast 1/3 and 5/6 probabilities respectively for yes, but players C, D, and E forecast 2/3, 1/6, and 1/2 respectively. It is best to do this now with pseudorandom number generators and Monte Carlo methods. For the case just described, I reran the simulation 12 times and got the following for the mean surprisals of the 5 players:

Figure 1

Player A (blue) had the correct forecast and naturally got the lowest surprisals, while players B and C (yellow and green) were both predicting a lot of yeses that never rolled up, so they have the highest mean surprisals at the end. Player E forecast a probability of 1/2 so this player always gets a surprisal of 0.69 regardless of whether the die rolls yes or no. This shows up as a straight purple horizontal line at 0.69 in the plot. The red player, player D, got surprisals comparable to player E’s but with much more statistical variation.

The total reward points parceled out to each player is consistent with the surprisals. For the simulation in figure 1 above, the corresponding total award accumulated by each player as a function of the question number is shown in the following plot:

Figure 2

As expected, player A accumulates the most rewards while player B gets the fewest.

We can start to build upon simulations like these by making things more realistic. An obvious problem with the above example is that, if players B and C are at all reasonable agents, then they will notice that they have the wrong forecast probabilities after just a handful of dice rolls. They will then discard their GUDs and update their forecasts accordingly. Thus, we need an algorithm to mimic forecast updating. We also need to account for situations where players occasionally refuse to agree on the outcome of a dice roll, giving a |V|  \neq 1. This requires a method for simulating consensus breakdown. Future updates will include these improvements.

A Dice Game

I thought it might be useful to illustrate some aspects of the system described in section 4 here in a very simple game, which I will describe below. It is very limited in what it can show, especially given that it only involves two participants. It is definitely not meant to capture all facets of the problem, but it can be a starting point for building intuition.

The game is essentially a competition between a dice expert A and a dice expert B. A table placed in the middle of a room has an open top box on it. The box contains a six-sided die. However, instead of the usual dots it has “yes” or “no” printed on each side. We do not know how many sides have “yes” and how many have “no.” We must defer to experts A and B to tell us.

Expert A has developed a Grand Unified Theory of Dice Construction (GUD-A). Sophisticated reasoning based on their theory has convinced expert A that 2 out of 6 of the sides say “yes.”

Expert B has a competing Grand Unified Theory of Dice Construction (GUD-B) in which they are equally confident. Expert B has determined, on the basis of their theory, that 5 out of 6 of the sides of the dice say “yes.”

Both expert A and expert B would like to convince us of the correctness of their GUD. To do so, they run a sequence of “experiments.” Each experiment consists of shaking the box around for a while, looking at the die, and announcing whether a “yes” or a “no” rolled up.

But neither expert A nor expert B is allowed to personally run the experiment. Instead, they must bring in an outsider to perform each roll. This “roller” does each experiment and announces a “yes” or “no” result based on what they see.

Now, there may be various reasons to be wary of a roller. They may have weak arms and not shake the box hard enough, or they may have bad eyesight that prevents them from accurately reading the top of the dice. They might just be untrustworthy. One of the players might suspect the roller of being biased in favor of the other player. So either one or both of the players might end up refusing to acknowledge the announced result of an experiment. If that happens, |V| = 0 and, according to the reward distribution algorithm, neither player receives any reward points on that experiment. Thus, the experts are motivated to consult with one another before each roll to ensure that they agree that the chosen roller is acceptable.

If one of the players (player A, for example) decides that they do not approve of how the game is being played, then that player may choose to leave the game entirely. In that case, player B may continue to play alone. But with a single player, \Delta s = 0 always, so player B would collect zero points on all subsequent questions. Therefore, each participant is incentivized to avoid acting in ways that might drive their opponent to quit.

Say that the players can play up to 50 dice rolls (or “experiments”). Assuming the players are driven by a self-interested desire to accumulate as many reward points as possible, can an outsider tell whether it is GUD-A or GUD-B that is the correct dice theory just by looking at how many reward points the players accumulate?

If both players and the roller are absolutely honest and reliable, then it will be fairly obvious. But we can ask what happens if there are occasional disagreements. By refusing to acknowledge outcomes that go against their own predictions, a player can narrow the difference between their total reward points and the reward points of their competitor. However, by doing this they also lower the total number of reward points distributed to the group.

The idea is that, when individuals try to accumulate the largest possible number of reward points, the group should hopefully bring clarity to the essential underlying questions. The dice game above is an extremely stripped down example of the type of scenario we might use to stress-test that idea. The fact that the players need to share a critical level of trust in each other and in the roller is what makes it more than just a simple dice rolling bet.

The biggest limitation of the above example is that there are only two players. It is straightforward, however, to extend it to many dice-betters.

A self-governing, self-regulating system for assessing scientific predictive power 

I have written a long essay explaining the basic ideas behind the project discussed on this blog, and I just made it available on the physics preprint arXiv here. I will refer to it frequently in future posts. Comments and suggestions are welcome.

See especially the reward distribution algorithm of section 4. Much of the work I am planning for the coming year will involve simulations to test the robustness of this and/or similar algorithms.

Screenshots from Ex Quaerum

The database of predictions was made into a project for a group of undergraduate computer science majors, and they’ve been working diligently on it over the past year. They are calling it “Ex Quaerum,” which according to google means “from the questions.” They are about to graduate, and I am extremely pleased with what they have created. I wanted to share a few screenshots of what looks to be shaping up to be a very cool website:

Here is the login page for user “Charles Darwin”:

Here is what a prediction page looks like:

And here is the profile page for a user:

Thanks to Taylor Brett, Jena Essary, Ashish Kondaka, Nathan Livingston, Aaron Williams, and Craig Woodington for all their hard work!

A prediction aggregation and assessment project

I plan to use this blog mainly to report on progress in a specific project. This first post is an attempt to explain the basics of what the project is all about.

My motivations originate from concerns I have about how research is conducted in my own subfield of physics. I was inspired by webpages like Good Judgment Open and Metaculus, which collect forecasts from ordinary people about a wide variety of questions, and quantify collective predictive power in a sort of competition. It struck me that something like this would be incredibly useful in more technical fields of science. What I have in mind is a database of quantified and dated predictions, submitted by scientists and experts, about the future outcomes of specific scientific research projects. By engaging with the database, the participating experts would accumulate reputational award points. The idea behind such a system is that it would encourage scientists to explain the empirical, predictive reasoning behind their theories, hypotheses and conjectures. The accumulated record of past predictions would make it clear how the predictive reasoning has fared.

For technical subjects, the main challenge is to devise a system that attracts participation and incentivizes good predictive activity. Thus, much of my energy has so far been spent trying to formulate an effective reward distribution algorithm. My plan is to use the next few posts trying out different ways of explaining the reasoning behind the algorithm as it has developed so far.

Another component of the work involves setting up an actual website with a prototype of the system. I currently have hired a number of computer science students to help with that and I will provide updates about the progress as (and if) it proceeds.