yet another developer blog: Counterfactual Regret Minimization

I implemented a classical algorithm to find nash equilibrium strategies in certain zero-sum games,
such as poker.

The code is here. In this repository in readme you can find references to understand theory behind algorithm. Here I'll give only its short description for example of poker:

Information Set - is a state of a game (combination of betting history / player cards), set of nodes in game tree.

This is regret for information set:

The regret is actually our weight of choosing action a in information set I. If regret is positive for action a - choosing action a in I gives money - positive reward. If it is negative - we should not choose this action - we lose - if we will choose this action.

This is utility in case we always choose action a given information set I:

This is utility in case we follow node strategy given information set I:

This is reach probability that player reached this game tree node:

The procedure known as regret matching - computing node strategies based on regrets:

The result is node strategy which the player should choose when reaching information set I.

Next I will describe the results of AKQ clairvoyance toy game, which also helps to understand theory of poker. Shortly this game is 2 player zero-sum game.

P1 can have A/Q, P2 can have K.
P1 can check or bet, P2 can check/call or fold.
At start of the game each player pays one blind.
The winner of pot is a player of higher card.

Here are the results of algorithm (nash equilibrium):

As you see - the results are matched ideally with known formulas in poker in perfectly polarized situations:

Bluff to value ratio:

And minimum defence frequency:

Lets analyze P2 behaviour in case of aggressive opponent P1 (increased bluffing frequency by 5%, more than gto). This can be achived via node locking. Here are the results:

As you see P2 exploits by almost always calling. Also note that there is almost no dependency
on bet size.
Let's see P2 behaviour in case of tight opponent P1 (bluffing frequency is less by 5% than optimal):

P2 here almost never calls.

yet another developer blog

Friday, July 31, 2020

Counterfactual Regret Minimization

No comments:

Post a Comment