Polymarket Research Toolkit

Trading Research

Prediction Markets

Backtesting

Research-first scraping + walk-forward backtester for Polymarket. Six strategies, deflated Sharpe ratios, and conservative cost models. Designed to fail loudly when no real edge exists — and it does.

Published

March 30, 2026

In plain English

Polymarket is a website where people bet real money on real-world questions: “Will Trump win the 2024 election?”, “Will Bitcoin be above $100k by year-end?”, “Will the Fed cut rates next meeting?” Each question has two sides — YES and NO — and the prices fluctuate between $0 and $1 based on what the market thinks the probability is.

If a market is mispriced — for example, NO is trading at $0.10 but the event has been almost certain for weeks — there’s potential profit in buying the cheap side. The question is: are these mispricings real, persistent, and tradeable after fees? Or do they look real in a backtest because the backtester is lying to you?

This project is a toolkit for answering that honestly. It does three things in order:

Scrapes every public number Polymarket exposes — every market, every historical price tick, every order book snapshot. Plus Kalshi (a US-regulated competitor) for cross-venue comparison.
Tests trading ideas against that historical record with a backtester deliberately designed to fail when no real edge exists.
Scans live for the few signals that survive the test, so they can actually be traded.

The interesting findings turned out to be negative — the most promising-looking strategy collapsed when tested honestly, for a specific data-quality reason explained below. That’s the project working as intended.

Anti-overfit methodology

Every result is structured to fail loudly when no real edge exists:

Walk-forward only. Strategies see prefixes of price series, never the future.
Discovery / test split at the universe level — the calibration strategy is fit on the first half of resolved markets and scored on the second.
Deflated Sharpe. When you test N strategies, the best-of-N is inflated by selection. Deflate by N before claiming anything (Bailey & López de Prado).
Conservative cost model. 1% taker fee + 0.5% half-spread per leg.
Trade-count floor. Anything with fewer than 100 holdout trades is reported as “no signal yet,” not as a result.

Strategy suite

Strategy	Hypothesis
`extreme_price_decay`	Buy NO when YES collapses near close — fade late confidence
`favorite_hold`	Buy YES when YES is persistently ≥ 0.95 near close
`longshot_bias`	Short the longshot — buy NO at 0.85–0.95
`complementary_arb`	YES + NO < $1 — needs the live book
`mean_reversion`	Fade single-bar 10c spikes mid-life
`calibration_edge`	Data-driven, fit on first half of universe only

Honest empirical findings

complementary_arb looked great in train, collapsed in test. Investigation: the training “edge” was a forward-fill artifact. Bar-resolution price-history shows YES + NO summing to anything between 0.5 and 1.7 because each leg’s prints don’t share timestamps. After bucketing to the hour and inner-joining, real imbalances ≤ 2c essentially never appear in bar data. The arb strategy can only work against the live book. Found because the test split was frozen.
Calibration analysis at the 24h horizon shows the 0–10% YES band actually resolves YES ~11% of the time (vs. 2.4% priced) — enough sample to be suggestive, not enough to bet on. Watch this band as more data accumulates.
Bar-data limitations. Hourly bars are too coarse for any real microstructure work; live websocket feeds are needed for liquidity / spread strategies.

Stack

requests + retry/rate-limit aware HTTP client; SQLite for markets / prices / books
Walk-forward engine with deflated Sharpe; reliability tables and Brier / log loss for calibration
Live-scan loop for complementary-pair edges
Six sprint reports + a research memo documenting the dead ends as carefully as the live ones

What it demonstrates

Treating a backtest as a hypothesis test, not a marketing screenshot
The discipline of letting your own strategies fail
Microstructure thinking: knowing the difference between bar data and the book