In a large sample of backgammon games, the winner of the first roll should be evenly divided. The number of doubles rolled by each player should be close to the same and the dice totals will follow a normal distribution. So writing about my experience is payment for the extensive field research I have done playing the game. Opinions differ on this, but I will let accusations (leveled by my wife) of ‘time wasting’ with ‘that cursed game’ accrue in the costs of serious research about statistical evaluations of fairness.
When choosing a backgammon app, I am concerned about the challenge that a CPU can offer but wary of accusations of cheating (although I cannot imagine why a programmer would purposively bias the algorithms to favor CPU). If a player feels cheated, then he will neither play the game for very long nor tolerate the pop-up adverts.
This is the sticking point and the impetus of my investigation. It would be counterproductive to design a game that cheats, but I cannot shake the feeling that it does. This speaks to the notion of face validity and reasonable doubt – concepts I have written about before. I wanted to put empirical analysis to task and figure out if my backgammon app is a filthy, time-sucking, low-down cheat.
People accuse apps of unfair play by means of fortunate rolls and other minor quirks that affect the game’s outcomes. The backgammon app that I play (it will remain unnamed to protect the reputation of the developer) has intrigued me in terms of accusations like this. Although I might not beat a supercomputer in a backgammon tournament, I have established a lopsided record of victories over this particular CPU and dominated the skill metrics (rounds won, points scored, opponent pips hit) All the while, it still shows unbiased statistics in the dicey elements of doubles rolled and first rolls won. In other words, all outward appearances show that the playing board is fair and I am just better than CPU.
The app developers responded to accusations of cheating on the game’s mobile download site by pointing out that the app measures some game play statistics that would show any bias in clear terms. There is also a feature whereby the player can ‘roll’ the dice for the CPU (using a real pair of dice or a random number generator program) and then enter the neutral rolling outcome into the app for the CPU to play. The strategic and tactical algorithms of the program appear to be static, meaning it's not a sophisticated learning program.
I compiled a seriously (or embarrassingly) large number of rounds and, despite exceeding a 2:1 win ratio, I still believe CPU cheats. It took a couple months to formulate indicators to detect a bias. I had to define and then record the phenomena (as in the table below), being overly conservative in my counts to maintain reliability. I classify a roll as a phenomenon if its manifestation is greater than double its random likelihood. For example, if the CPU gets a roll of 6-5 more than 11% of the time that it would need it to hit my exposed pip.
Phenomenon | Definition | Example |
Ideal Rolls | Player’s vulnerability is exploited to the fullest by the CPU getting a low probability roll | I have a lone exposed pip that can only be hit by the CPU rolling a 6-5 (5.56% probability) |
Bad Doubles | Player’s double dice roll makes the position worse than before the roll | Rolling 4-4 when the only option is to leave a pip exposed and/or moving pieces to weaker positions |
Handicapping | Player’s roll values are significantly lower than CPU during the endgame | CPU has a double digit deficit in point count and rolls high values while Player rolls low values and turns the game |
For instance, ‘ideal rolls’ were only when CPU hit an exposed pip by rolling a value of nine or higher and ‘bad doubles’ were only counted as situations where I was forced to expose a pip that could get hit by CPU rolling seven or less (see the figure below for the probabilities). Handicapping was counted on a per round basis when either player’s dice totals were less than 66% of the other player’s dice totals after the unloading phase began.
My double rolls often came when I least needed them and end game rolls appeared handicapped, as the CPU would overcome the odds-on point count deficit quite frequently when in the competitive offloading stage of the game (see Ross, Benjamin & Munson 2007). It seems that too often the CPU would get the ‘perfect’ roll for a situation while I was rarely the beneficiary of rolling exactly what was needed to turn the game in my favor.
Backgammon pundits and the app’s developers argue that this apparent phenomenon is actually an artifact of playing strategy – good players maximize opportunities for advantage and minimize opportunities of hazard. But the fallacy of this explanation is that I am a good player (as evidenced by my lopsided record of victory) and yet my minimal vulnerabilities are exploited beyond the norms of probability.
The CPU does get more ideal rolls than me, I get more bad doubles than CPU, and handicapping does occur. The phenomena of ideal rolls and handicapping are statistically significant.
So what may be learned from this important study? Without delving into the program code of the app and having a thorough audit of CPU’s every move, the lesson is this: appearances can be deceiving.
I will swear that I am being cheated but cannot prove it beyond a doubt. My qualitative assessment is validated by my apparent superiority at the game, bolstered by the intangible things that enable humans to continually beat computers at chess and other games of calculated chance.
We all know when a computer game is getting ‘harder’ by just making things more difficult, rather than qualitatively improving its skill. I argue that this is the case for my backgammon app and that there may even be something more intriguing going on here.
Backgammon – as a human construct encompassing randomness, strategy, and tactics – illustrates that despite the incredible processing power of our mobile phones, the human mind is still the master of the programs.