Game Intelligence as a Path to AGI, Othello as a Testbed (Detail)

Detailed look at the design considerations around using Othello as a testbed for game intelligence.

This is the follow-up material for the first week’s note “Game Intelligence as a Path to AGI, Othello as a Testbed”.

Categorizing Game-Playing Techniques

To frame our discussion of various approaches in game-playing AI, we can broadly categorize them as follows:

Heuristic-Based Approaches
- Definition: Rely on fixed, domain-specific rules or evaluation functions.
- Examples:
  - Positional Strategies: Simple rules such as “corners > edges > center” in Othello.
  - Handcrafted Evaluation Functions: Score board positions based on static criteria.
Planning and Search Algorithms
- Definition: Explore future moves by building and analyzing a search tree.
- Subcategories:
  - Deterministic Tree Search:
    - Alpha–Beta Pruning: Uses minimax with pruning based on heuristic evaluation.
  - Simulation-Based Search:
    - Monte Carlo Tree Search (MCTS): Uses random playouts and statistical methods (like UCT) to guide move selection.
- Key Insight: While these methods may use heuristic evaluations (e.g., UCT’s exploration term), they are fundamentally planning methods.
Connectionist / Learning-Based Approaches
- Definition: Leverage neural networks to learn strategies and board evaluations from data.
- Examples:
  - Transformer Models: Models like Othello-GPT, which learn from game trajectories and build internal representations of the board.
  - Combined Methods (AlphaZero): Neural networks provide policy and value estimates to guide planning algorithms like MCTS.

Deep Dive: Design Considerations for Othello AI

Below is a more detailed look at the design considerations around using Othello as a testbed for “game intelligence,” including how much gameplay experience is necessary, ways to conduct self-play learning (e.g. AlphaZero style), how to build datasets, and how to model rewards effectively.

1. How Much Gameplay Experience Is Necessary?

Experience Requirements by Approach

Rule-Based or Heuristic Agents
- Minimal direct training data: A rule-based agent simply encodes heuristics (e.g., “corners > edges > center” or “avoid X-squares”) and does not require large-scale game logs. Performance, however, often plateaus well below expert levels.
Classic Search-Based Approaches (Minimax, MCTS)
- Online search rather than offline training: These systems do not necessarily learn from tens of thousands of offline games. Instead, they search forward from the current state during each move (using, for example, alpha–beta pruning or Monte Carlo simulations).
- Key limitation: For deeper search you need significant computational resources or good heuristics; you don’t necessarily need a separate “training dataset,” but you do pay time cost during gameplay.
- See more below
Neural Network–Based Agents (AlphaZero-Style)
- Potentially large-scale self-play experience: The AlphaZero framework for smaller board games like Othello can require tens or hundreds of thousands of self-play games – or more – to converge to a strong policy and value function.
- The good news is that Othello is simpler than Go or chess:
  - The branching factor in Othello is typically lower (especially in the endgame, though it can be moderate in the mid-game).
  - Convergence can happen faster than in more complex games if the model architecture is well-tuned.
- In practice, many Othello-specific AlphaZero implementations show that a few hundred thousand self-play games (or fewer) can already exceed strong amateur or even near-expert human performance, especially if combined with an efficient search (MCTS).
Hybrid or Offline + Online Approaches
- Dataset bootstrapping: Combining offline expert game records with online self-play can reduce the required self-play experience. The model begins with a decent policy from the offline data, then refines it through self-play (similar to AlphaGo’s early steps).

2. Self-Play Learning (AlphaZero-Style)

Self-play remains one of the most successful techniques for learning Othello from scratch:

Policy and Value Network Training
- A single neural network outputs both a policy (move probabilities) and a value estimate (chance of eventually winning).
- Training loop:
  1. Self-play: Use Monte Carlo Tree Search (MCTS) guided by the current network to play full games.
  2. Data collection: Record (state, MCTS-improved policy, game outcome) triplets.
  3. Network update: Train the network to predict both the MCTS-improved policy and the final outcome from those states.
Experience Replay
- Collect states from many games into a replay buffer.
- Randomly sample from this buffer to train so that the model sees a broad distribution of positions.
- This avoids the model overfitting to the most recent self-play games only.
Hyperparameter Tuning
- MCTS simulations per move: More simulations → better move selection but slower gameplay.
- Network capacity: Othello is not as large as Go, so a relatively modest convolutional or residual architecture is often enough.
- Learning rate, batch size, etc.: Standard hyperparameters can be adapted from existing AlphaZero-like codebases.
Phase Awareness
- Some AlphaZero-like approaches incorporate phase-specific heuristics (e.g., lower search depth in the early game, deeper near the endgame).
- Another trick is to let the value network condition on how many moves have been made, as mid-game vs. endgame strategies differ significantly.

3. Dataset Construction

Although AlphaZero is purely self-play, many practitioners combine or compare it with curated data:

Expert Game Records
- Othello tournament databases (for instance, from the World Othello Championship) are often publicly available.
  - Othello Championship
  - Othello Synthetic
- These records can bootstrap a supervised learning stage: the network learns to predict top moves from strong players.
Situation-Specific Response Databases
- You can build small “opening books” of known strong responses for typical early configurations.
- Similarly, for the endgame, you can store a partial perfect-play tablebase for smaller subsets of the board.
- Merging these with a trained model can reduce the model’s burden in well-understood phases of the game.
Time-Constrained Decision Patterns
- Observe how strong (human or engine) players react when time is low. The system can learn “fast approximate solutions” to complex positions, which is especially relevant if you implement a time-limited or real-time Othello variant.

4. Reward Modeling

Reward modeling is crucial so that your agent doesn’t just greedily flip the most discs at every turn (which often backfires).

Short-Term vs. Long-Term Rewards
- Naive approach: Reward = net discs captured per move. This can lead to poor strategies that blow out the board early and then get cornered or forced to hand corners over to the opponent in later moves.
- Improved approach: Reward = final game outcome (win/lose/draw). The network or search sees no immediate “positive” for flipping discs – it only gets credited if it eventually leads to a win. However, it can be slow for an agent to discover good strategic play from just a binary outcome signal.
Phase-Differentiated Reward
- Opening/Mid-Game Reward Emphasis: Prioritize mobility or stable edges more than disc count.
- Endgame Reward Emphasis: The disc count or forced lines matter more in the final 10 moves.
- In practice, you can modulate the reward or features used in the value function across the game’s timeline.
Adaptive Reward Systems
- Curriculum Learning or Shaped Rewards: Early in training, provide small positive rewards for moves that:
  - Increase mobility
  - Capture stable discs in corners
- Gradually reduce these shaping terms so that the model eventually learns the unshaped objective (i.e., winning the game).
- Self-Adjusting Heuristics: As the agent’s policy matures, the weighting of “positional advantage” vs. “disc advantage” can be tuned.

Planning and Searching Techniques

Alpha-Beta Pruning in Othello

Minimax Basics in Othello
- Othello is a zero-sum, two-player game where players take turns placing pieces.
- The minimax algorithm evaluates each move by assuming:
  - Maximizing Player (MAX, e.g., Black) tries to maximize their score.
  - Minimizing Player (MIN, e.g., White) tries to minimize MAX’s score.
- The search tree expands until a terminal state (win/loss/draw) or a depth limit is reached.
- A heuristic evaluation function (e.g., piece count, mobility, stability) estimates the position’s value.
Alpha-Beta Pruning for Efficient Search
- Instead of evaluating all possible moves, alpha-beta pruning cuts off unnecessary branches:
- Alpha ($\alpha$): The best (highest) score found for MAX.
- Beta ($\beta$): The best (lowest) score found for MIN.
- If a move’s evaluation exceeds $\beta$ (MIN can force a worse outcome for MAX), further search in that branch is pruned.
- If a move’s evaluation is lower than $\alpha$ (MAX can force a better outcome), the branch is pruned.
How Alpha-Beta Pruning Helps in Othello
- Othello has a large branching factor (up to ~60 possible moves early on).
- Without pruning, minimax would evaluate every move at every depth, making deep searches computationally expensive.
- With alpha-beta pruning, irrelevant branches are skipped, reducing the number of nodes evaluated significantly.
- Effectiveness depends on move ordering—better ordering (e.g., evaluating best moves first) results in more pruning.
Practical Use in Othello AI
- Strong Othello AI engines (e.g., Logistello, WZebra) use alpha-beta pruning with heuristic evaluations to search deeper.
- Combining move ordering heuristics (e.g., mobility, corner control) with iterative deepening allows efficient play.
- More advanced AI (e.g., AlphaZero) replaces minimax with self-play reinforcement learning to learn optimal strategies.

Code Snippets (Alpha-Beta Pruning)

function minimax(node, depth, alpha, beta, maximizingPlayer):
    if depth == 0 or node is terminal:
        return evaluate(node)
    if maximizingPlayer:
        value = -inf
        for child in node:
            value = max(value, minimax(child, depth-1, alpha, beta, False))
            alpha = max(alpha, value)
            if alpha >= beta:
                break
        return value
    else:
        value = inf
        for child in node:
            value = min(value, minimax(child, depth-1, alpha, beta, True))
            beta = min(beta, value)
            if alpha >= beta:
                break
        return value

Key advantages

Deterministic algorithm based on perfect information
Guaranteed to find the optimal move with sufficient depth

Limitations

Time-consuming for deep searches
Requires a good evaluation function

Summary: Alpha-beta pruning significantly improves Othello AI by reducing unnecessary evaluations in minimax search, allowing deeper searches within computational limits. Its effectiveness is maximized with good move ordering and heuristic evaluations.

Monte Carlo Tree Search (MCTS)

Monte Carlo Tree Search (MCTS) is an iterative algorithm that builds a search tree by balancing exploration and exploitation. It is particularly effective in games like Othello, where the state space is large and a detailed evaluation function for every state is hard to define.

MCTS consists of four main phases:

Selection: Starting at the root node (current board state), the algorithm traverses the tree by selecting child nodes based on a policy—commonly using the Upper Confidence Bound (UCT) formula. This phase aims to balance between exploring new moves and exploiting known good moves.
Expansion: When a leaf node is reached that is not terminal, the algorithm expands the tree by adding one or more child nodes corresponding to legal moves from that state.
Simulation (Playout):
A simulation is run from the newly expanded node until a terminal state (win, loss, or draw) is reached. These simulations typically use random moves or simple heuristics to estimate the outcome.
Backpropagation:
The result of the simulation is propagated back up the tree, updating win/visit counts for each node along the path. This statistical information then guides future move selections.

Below is the pseudocode that outlines the core MCTS loop:

function MCTS(root):
    while within computational budget:
        leaf = selection(root)         # Select promising node using UCT
        child = expansion(leaf)          # Expand by adding child node(s)
        result = simulation(child)       # Perform random playout to a terminal state
        backpropagation(child, result)   # Update statistics along the path
    return best child of root

The key to the selection phase is the UCT (Upper Confidence Bound applied to Trees) formula:

UCT = (winScore / visitCount) + c * sqrt(ln(parentVisitCount) / visitCount)

Exploitation term (winScore/visitCount): This part calculates the average win rate (or success rate) for that move based on past simulations. A higher value means that when the move was tried, it led to better outcomes. It “exploits” what is already known to work well.
Exploration term (c * sqrt(ln(parentVisitCount)/visitCount)) This term gives a bonus to moves that haven’t been tried as much. If a move has been visited only a few times, the denominator is small, increasing the exploration bonus. The logarithm of the parent’s visit count ensures that as the overall number of simulations grows, the bonus increases slowly. The constant c controls how much extra weight is given to exploring less-visited moves. This term “explores” new possibilities that might have a high potential even if their past performance is uncertain.

Advantages

No detailed evaluation function required: MCTS relies on simulation outcomes, making it robust in complex environments.
Focuses search on promising regions: The UCT formula directs computational resources to moves with high potential.
Anytime algorithm: It can be interrupted at any time and still provide a reasonable move.

Limitations

Dependence on simulation count: The quality of the move selection depends on the number and quality of simulations.
Probabilistic outcomes: Since the algorithm relies on random playouts, its results are based on statistical approximations which can sometimes be inconsistent.

MCTS has been successfully integrated with neural network-based approaches (such as in AlphaZero) where the network provides a policy and value estimate to guide the search. This hybrid method has proven especially effective in Othello, accelerating learning and improving performance with fewer self-play games.

From AlphaGo to MuZero

Let’s watch this video to understand better about the AlphaGo trilogy: From AlphaGo to MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model) by Thore Graepel (DeepMind)

Teaching Game Dynamics to Transformer with Othello Sequences

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, Li et al, (ICLR 2023 Oral), Code, Blog Post

Objective: The paper investigates whether sequence models, such as GPT variants, develop internal representations of the processes that generate their training sequences. Specifically, the authors train a transformer-based model (Othello-GPT) to predict legal moves in the board game Othello, examining whether the model internally constructs a representation of the game board state.

Key Contributions & Findings

Training a GPT Model on Othello Moves: The authors train a GPT-style transformer model to predict legal Othello moves based purely on game transcripts such as Othello Championship, Othello Synthetic datasets, without explicit knowledge of board rules or structure. The model achieves high accuracy in predicting legal moves, suggesting it is not relying solely on memorization.
(Li et al., page 3) We trained an 8-layer GPT model (Radford et al., 2018; 2019; Brown et al., 2020) with an 8-head attention mechanism and a 512-dimensional embedding. The training was performed in an autoregressive fashion. For each partial game $\{y_t\}_{t=1}^T$, the computation process starts from indexing a trainable word embedding consisting of the 60 vectors, each for one move, to get $\{x^0_{t-1}\}_{t=1}^T$. Then, the vectors are processed by 8 multi-head attention layers. We denote the intermediate feature for the $i$-th layer after the $l$-th layer as $x^l_i$. By employing a causal mask, only the features at the immediately preceding layer and earlier time steps $x^l_{j<i}$ are visible to $x^l_i$. Finally, $x^8_t$ goes through a linear classifier to predict logits for $y_t$. We minimize the cross-entropy loss between the ground-truth move and the predicted logits by gradient descent.
The model starts from randomly initialized weights, including in the word embedding layer. Though there are geometrical relationships between the 60 words (e.g., C4 is below B4), this inductive bias is not explicitly given to the model but rather left to be learned.
Emergent World Representation: Using probing techniques, the authors find that the model encodes a representation of the board state in its internal activations. Nonlinear probes successfully recover board states from model activations, whereas linear probes fail, suggesting the representation is complex and nonlinear.
(Li et al., page 4) A probe is a classifier or regressor whose input consists of internal activations of a network, and which is trained to predict a feature of interest. If we are able to train an accurate probe, it suggests that a representation of the feature is encoded in the network’s activations. We take the autoregressive features $x_t^l$ that summarize the partial sequence $y_{\leq t}$ as the input to the probe and study results from different layers $l$. The output $p_\theta{(x_t^l)}$ is a 3-way categorical probability distribution.
Interventional Experiments to Probe Causality: By modifying internal activations to represent counterfactual board states, the authors demonstrate that the model’s internal representations influence its move predictions causally. This confirms that the learned world model is not just a statistical artifact but plays an active role in decision-making.

(A) An intervention is applied to a single board tile (e.g. E6), flipping its state from white to black.
(B) Four snapshots illustrate the effect on the model’s internal world state and predictions: the lower left shows the pre-intervention board, with the upper left displaying its corresponding legal-move probability distribution; after intervening, the lower right shows the updated board with E6 flipped, and the upper right reveals the revised legal-move probabilities.
(C) A schematic of the intervention process: starting from a predefined layer, activations at the last token are modified (dark blue) via gradient descent, and then recomputed layer by layer (light blue indicates unmodified activations) until the final prediction is made.
Latent Saliency Maps for Interpretability: The authors introduce latent saliency maps, which visualize how the model’s internal board representation affects its predictions. Comparing a model trained on synthetic random moves vs. one trained on championship games reveals differences in decision-making: the former focuses strictly on legal moves, while the latter incorporates strategic play patterns.

Latent saliency maps: Each subplot shows a different game state, and the top-1 prediction by the model is enclosed in a black box. Colors (red is high, blue is low) indicate the contribution of a tile’s state to this prediction. The contribution is higher when changing the internal representation of this tile makes the prediction less likely.

Implications & Future Work

The results suggest that transformers trained on structured sequence data can develop implicit world models.
These findings may generalize to natural language models, providing insights into how LLMs internally model the world.
The paper proposes extending this approach to more complex games and applying similar interpretability techniques to natural language tasks.

This paper contributes to understanding how transformers can implicitly learn structured representations from sequential data. The use of Othello as a controlled environment allows for precise analysis of internal representations, providing valuable insights into the capabilities and limitations of deep learning models in building world models.

Putting It All Together

Below is a rough outline of how you might structure an Othello experiment that progresses from basic heuristics to advanced self-play:

Phase 0: Baseline/Heuristic
- Implement a quick heuristic agent (e.g., corners » edges » center, avoid “X-squares”).
- Evaluate vs. a random player to establish a performance baseline.
Phase 1: Supervised Bootstrapping
- Collect ~10,000 expert-level Othello positions (opening to mid-game).
- Train a small neural network to predict the expert’s moves from each position.
Phase 2: Self-Play Reinforcement Learning (AlphaZero-Style)
- Initialize the policy/value network with weights from Phase 1.
- Conduct self-play with MCTS, storing (state, search-based policy, result) in a replay buffer.
- Periodically train on batches from the replay buffer, aiming to match the MCTS policy and the final game outcomes.
Phase 3: Reward Shaping & Curriculum
- In early training iterations, give additional positive feedback for moves that:
  - Increase mobility
  - Capture stable discs in corners
- Gradually reduce these shaping terms so the model eventually learns the unshaped objective (winning the game).
Phase 4: Evaluate and Refine
- Pit your self-play agent vs. known heuristic agents and measure win rates.
- Adjust MCTS search budget, network architecture, or incorporate a small opening book.
- For endgame, you might use a shallower or deeper search specialized for forced sequences.
Beyond Othello
- Once you have a stable pipeline, test how quickly it can adapt to different board sizes (6×6 vs. 10×10) or slightly modified rules.
- This tests “generalization” – a core element of broader intelligence.

Key Takeaways

Othello’s Complexity
- Rules are simple, but strategic depth (especially corners, mobility, parity) is significant.
- A purely naive approach fails quickly in mid- to high-level games.
Self-Play Convergence
- Hundreds of thousands of self-play games with MCTS + neural networks can produce near-expert performance.
- For Othello, that can be done in a fraction of the compute required by more complex games like Go.
Importance of Good Reward Design
- A naive disc-count reward can cause local maxima.
- Combine final-outcome reinforcement with intermediate strategic signals.
Data Augmentation & Transfer
- Expert game logs + self-play yield faster learning than self-play alone.
- Phase-based search or “tablebases” for endgame can drastically improve final accuracy.

In short, using Othello as a testbed can illuminate many aspects of what we often call “game intelligence”: strategic reasoning, planning under time constraints, learning from self-play, and transferring knowledge to variations of the same game. All of these are stepping stones that echo fundamental challenges on the path to more general forms of intelligence.

References

Lecture Note (Othello Basics & AGI)
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task by Kenneth Li et al. Code, Blog Post
From AlphaGo to MuZero - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model) talk by Thore Graepel (DeepMind)