SPRT: sequential probability ratio test in chess
SPRT
Definition
SPRT stands for Sequential Probability Ratio Test. It is a statistical method, originally developed by Abraham Wald in the 1940s, that decides between two hypotheses by evaluating results as they come in, rather than fixing a sample size in advance. In computer chess, SPRT is used to determine whether a new engine version (or parameter change) is measurably stronger than a baseline with as few games as necessary.
How It Is Used in Chess
Modern chess engines are improved through rapid A/B testing: a candidate build (B) is played against a reference build (A). SPRT monitors the stream of game results and stops early once there is enough statistical evidence to:
- Accept the candidate (e.g., it is likely stronger than the baseline by at least a target Elo amount), or
- Reject the candidate (no significant improvement detected, or it is weaker).
This process is widely used by open-source and research projects (e.g., large community testing frameworks for top engines). It enables very fast iteration cycles, turning thousands of small code changes into consistent strength gains over time.
Key Concepts and Parameters
- Hypotheses (H0 and H1): Typically framed in Elo terms. H0 (null) might be “no improvement” (0 Elo), while H1 (alternative) is “a meaningful gain” (for example +3 to +10 Elo). Exact targets vary by project and time control.
- Error rates: Alpha (type I error, false accept) and Beta (type II error, false reject) control risk. Common settings are around 5–10% but vary by framework and phase of testing.
- LLR (Log-Likelihood Ratio): The running statistic that compares how likely the observed results are under H1 versus H0. When the LLR crosses the “accept” boundary, the test passes; when it crosses the “reject” boundary, the test fails.
- Truncated SPRT: Many frameworks set a maximum game cap. If neither boundary is reached by then, a fallback decision (often “fail” or “inconclusive”) is made to guard against indefinite runs.
- Design choices that matter: Opening book selection, pairing method, draw adjudication, time control (e.g., short test control vs. long test control), and engine options (syzygy/tablebases, NNUE nets, hash) all affect variance and sensitivity.
Practical Usage Workflow
- Define targets and risks: choose H0 (e.g., 0 Elo) and H1 (e.g., +5 Elo), along with acceptable alpha and beta (for example, 0.05 and 0.10).
- Set testing conditions: time control, hardware, concurrency, fixed opening pairs (often diverse, balanced suites), and adjudication rules for 50-move, repetition, and tablebase endings.
- Run A vs. B games, recording results as wins, draws, and losses from a consistent perspective (e.g., candidate’s perspective).
- Update the LLR after each result. If the LLR crosses the upper boundary, accept B (it likely meets the Elo target). If it crosses the lower boundary, reject B.
- If neither boundary is crossed by the truncation point, make the predefined fallback decision (often “reject” or “needs more testing”).
Example
Suppose you test a candidate Engine B against baseline Engine A with H0 = 0 Elo and H1 = +5 Elo, alpha = 0.05, beta = 0.10. You use a balanced 50-move opening suite at rapid time controls.
- After 400 games, results might be: 88 wins, 52 losses, 260 draws for B (55% score). The running LLR is positive but not yet above the accept boundary, so the test continues.
- After 700 games, the score stabilizes around 55.5%. The LLR crosses the accept boundary. The test stops and B is accepted as likely meeting or exceeding +5 Elo under these conditions.
- On another day, a different patch might reach 1,200 games with a 51% score and never cross the accept line; the LLR drops below the reject boundary, and the patch fails quickly.
Note: The exact LLR values and boundaries depend on the chosen model for wins/draws/losses and the framework’s implementation. The numbers above are illustrative.
Strategic and Historical Significance
SPRT is one of the core reasons engine projects can move fast without sacrificing rigor. Its early-stopping property saves enormous compute compared to fixed-length tests, enabling:
- Rapid iteration on code changes, NNUE nets, and parameters.
- Efficient sifting of many small improvements, where most patches are neutral or slightly negative.
- Consistent, reproducible standards across large, distributed testing communities.
Historically, Wald’s SPRT was designed for industrial quality control and later adapted widely. In computer chess, it became a cornerstone of large-scale frameworks that have pushed top engines to superhuman strength.
Best Practices and Pitfalls
- Use diverse, balanced opening books to reduce bias and improve generality.
- Pair games by color and opening to control for first-move advantage and outlier lines.
- Be consistent with hardware and time controls; changes alter variance and Elo sensitivity.
- Monitor draw rates: very high draw rates can slow decision-making; curated openings can help.
- Beware of overfitting to a specific opening set, time control, or test harness.
- Confirm borderline passes at longer time controls or with stricter settings before merging.
Interesting Facts and Anecdotes
- Wald’s SPRT was praised for saving time and resources in WWII-era testing—exactly the advantage it brings to engine development today.
- Large community projects often stage tests: a quick “short TC” SPRT screen filters patches, and promising ones proceed to “long TC” SPRT for deeper validation.
- Even tiny Elo targets (+1 to +5) compound over thousands of accepted patches, contributing significantly to the relentless rise in engine strength.