Pattern Recognition and Reinforcement Learning¶

Overview¶

Problem: Getting son to eat broccoli

Strategies:

Acclimatization: Serve once a week until he gets used to it
Better than alternative: It’s better than spinach
Enticements: We’ll watch your favorite movie after
Trickery: It’s candy
Testimonial: Your best friend likes it
Wait it out: We’re not leaving this table until …
Piecemeal: Just one little bite
Appeal to Reason: Great source of vitamins K and C

Reinforcement Learning:

Randomly try strategies. If they work, choose them more often.

Rock Paper Scissors¶

This is an iterated adversarial zero-sum two-person game of perfect information.

The scoring logic is easily encoded in a Python dictionary:

# scorer['RS'] -> 1 meaning Rock cuts Scissors is +1 for the first player.
scorer = dict(SP=1, PR=1, RS=1, PS=-1, RP=-1, SR=-1, SS=0, PP=0, RR=0)

Now, it’s time to thinking about winning (not losing):

Playing randomly assures that we lose no more than one-third of the time, regardless of our opponent’s strategy.
If our opponent is completely predictable, we can always win.
In the real world, humans have a hard time simulating perfect randomness or who “have a plan” to beat us:
- A mild preference for paper
- Propensity to select rock after they’ve played scissors
- If we play paper, they often select scissors on the next round
- They tend to copy our last play
In other words, there may be patterns that we can detect and exploit to win more than a third of the time.

Our Approach¶

Generate multiple competing pattern recognition strategies
Choose between the strategies multi-arm bandit approach to reinforcement learning.

Core logic:

for trial in range(trials):
  # choose our move
  # get opponent's move
  # determine the winner
  # update move history and strategy weights

Helpful Python Skills¶

itertools.chain() links to multiple iterables into one:

>>> from itertools import chain
>>> list(chain('RPRPS', 'RPS'))
['R', 'P', 'R', 'P', 'S', 'R', 'P', 'S']

itertools.cycle() repeats a sequence over and over again:

>>> from itertools import cycle, islice
>>> list(islice(cycle('RPS'), 9))
['R', 'P', 'S', 'R', 'P', 'S', 'R', 'P', 'S']

collections.Counter() tallies frequencies of the inputs:

>>> from collections import Counter
>>> Counter('RPRPSRPSR')
Counter({'R': 4, 'P': 3, 'S': 2})

Counter.most_common(n) picks the n most common counts:

>>> Counter('RPRPSRPSR').most_common(2)
[('R', 4), ('P', 3)]

zip(*somedict.items()) transposes items into a separate keys and values:

>>> d = dict(R=4, P=3, S=2)
>>> keys, values = zip(*d.items())
>>> keys
('R', 'P', 'S')
>>> values
(4, 3, 2)

zip(a, a[1:]) groups input into overlapping digraphs:

>>> a = 'abcde'
>>> list(zip(a, a[1:]))
[('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e')]

random.choices(population, weights)[0] makes a weighted choice from a population:

>>> from random import choices
>>> choices(['R', 'P', 'S'], [6, 3, 1], k=10)
['P', 'P', 'R', 'S', 'R', 'P', 'P', 'R', 'R', 'P']
>>> Counter(_)
Counter({'P': 5, 'R': 4, 'S': 1})

The Code¶

''' Generic learning algorithm for adversarial two-person
    zero-sum games of perfect information.

    General approach:  Make a list of strategies based
    on pattern recognition. Use the multi-arm bandit
    approach to learning which strategies win the most.
'''

from collections import Counter
from random import choices, choice
from itertools import chain, cycle
from pprint import pprint

# Game Definition ################################

# https://en.wikipedia.org/wiki/Rock%E2%80%93paper%E2%80%93scissors
# scorer['RS'] -> 1 meaning Rock cuts Scissors is +1 for the first player.
scorer = dict(SP=1, PR=1, RS=1, PS=-1, RP=-1, SR=-1, SS=0, PP=0, RR=0)

# Scissors cuts Paper; Paper covers Rock; Rock crushes Scissors
ideal_response = {'P': 'S', 'R': 'P', 'S': 'R'}
options = ['R', 'P', 'S']

# Strategy Utilities #############################

def select_proportional(events, baseline=()):
    rel_freq = Counter(chain(baseline, events))
    population, weights = zip(*rel_freq.items())
    return choices(population, weights)[0]

def select_maximum(events, baseline=()):
    rel_freq = Counter(chain(baseline, events))
    return rel_freq.most_common(1)[0][0]

# Strategies #####################################

def random_reply(p1hist, p2hist):
    return choice(options)

def single_event_proportional(p1hist, p2hist):
    """ When opponent plays R two-thirds of the time,
        respond with P two-thirds of the time.'
    """
    prediction = select_proportional(p2hist, options)
    return ideal_response[prediction]

def single_event_greedy(p1hist, p2hist):
    """ When opponent plays R more than P or S,
        always respond with P.'
    """
    prediction = select_maximum(p2hist, options)
    return ideal_response[prediction]

def digraph_event_proportional(p1hist, p2hist):
    """ When opponent's most recent play is S
        and they usually play R two-thirds of the time
        after an S, respond with P two-thirds of the time.
    """
    recent_play = p2hist[-1:]
    digraphs = zip(p2hist, p2hist[1:])
    followers = [b for a, b in digraphs if a == recent_play]
    prediction = select_proportional(followers, options)
    return ideal_response[prediction]

def digraph_event_greedy(p1hist, p2hist):
    """ When opponent's most recent play is S
        and they usually play R two-thirds of the time
        after an S, respond with P all of the time.
    """
    recent_play = p2hist[-1:]
    digraphs = zip(p2hist, p2hist[1:])
    followers = [b for a, b in digraphs if a == recent_play]
    prediction = select_maximum(followers, options)
    return ideal_response[prediction]

strategies = [random_reply, single_event_proportional, single_event_greedy,
              digraph_event_proportional, digraph_event_greedy]

def play_and_learn(opposition, strategies=strategies,
                   trials=1000, verbose=False):
    strategy_range = range(len(strategies))
    weights = [1] * len(strategies)
    p1hist = []
    p2hist = []
    cum_score = 0
    for trial in range(trials):
        # choose our move
        our_moves = [strategy(p1hist, p2hist) for strategy in strategies]
        i = choices(strategy_range, weights)[0]
        our_move = our_moves[i]

        # get opponent's move
        opponent_move = opposition(p2hist, p1hist)

        # determine the winner
        score = scorer[our_move + opponent_move]
        if verbose:
            print(f'{our_move} ~ {opponent_move} = {score:+d}'
                  f'\t\t{strategies[i].__name__}')
        cum_score += score

        # update move history and strategy weights
        p1hist.append(our_move)
        p2hist.append(opponent_move)
        for i, our_move in enumerate(our_moves):
            if scorer[our_move + opponent_move] == 1:
                weights[i] += 1

    print(f'---- vs. {opposition.__name__} ----')            
    print('Total score:', cum_score)
    pprint(sorted([(weight, strategy.__name__) for weight, strategy in zip(weights, strategies)]))

if __name__ == '__main__':

    def human(p1hist, p2hist):
        return input(f'Choose one of {options!r}: ')

    def fixed_ratio(p1hist, p2hist):
        return choices(options, (1, 2, 3))[0]

    def cycling(series):
        iterator = cycle(series)
        def cycle_opponent(p1hist, p2hist):
            return next(iterator)
        return cycle_opponent

    play_and_learn(opposition=fixed_ratio)
    play_and_learn(opposition=random_reply)
    play_and_learn(opposition=cycling('RPRSS'))
    play_and_learn(opposition=human, trials=10, verbose=True)

Critiques and Improvements¶

When humans are losing, they generally change they strategy, so we should “age” our weights to adapt to the change.
The random_reply strategy wins one-third of the time so it keeps getting selected when other strategies are doing better.
We need to keep random_reply in the mix in case our strategies get figured-out and gamed. Randomness is your best defense against a smarter adversary.
The multi-arm bandit approach (choosing the strategy proportionally to wins and losses) results in very slow learning.
Humans get bored with this game quickly, so it is hard get people to test it.

Where you can go from here¶

None of the code logic is hard-wired to the basic game.
It is easy to add new strategies.
Or to evaluate various published how-to-win strategies.
Changing the game definition logic allows for more complex games such as rock-paper-scissors-lizard-spock: