Ryan Tolone

Deep CFR for 5-Card PLO Heads-Up

Sat, 25 Apr 2026 00:00:00 GMT

In plain English

PLO stands for Pot-Limit Omaha, a poker variant similar to Texas Hold’em except each player gets four hole cards instead of two and must use exactly two of them combined with three from the board. 5-card PLO is the same thing but with five hole cards. It’s the most popular high-stakes cash game variant outside of no-limit Hold’em and is famously the most mathematically complex form of poker — there are far more possible hands and the equity calculations between hands are much closer, so the strategy involves much more nuance.

This project teaches a neural network to play heads-up 5-card PLO at near-equilibrium strength, using the same Deep CFR algorithm I built up to in the HUNL project (the no-limit Hold’em version). PLO5 has no published Deep CFR work — even Pluribus and Libratus stop at Hold’em — so this is original research-grade implementation, not a port of someone else’s code.

The headline isn’t a final exploitability number (the training is still running and won’t finish for weeks of compute). The headline is engineering: I diagnosed a 12–25× performance gap between what I had and what’s needed to finish in reasonable time, applied a series of optimizations that delivered a 58× speedup on the dominant cost, and then honestly reported that the encoder is no longer the bottleneck and the remaining work has to come from a different lever (multi-process traversal). It’s the kind of profile-driven optimization story that’s more useful than a glossy result, because it shows what scaling really looks like.

Why PLO5

Heads-up no-limit hold’em was solved (Stage 4). 4-card PLO is in the literature. 5-card PLO — five hole cards, choose-2 in your hand combined with choose-3 from the board — has no published Deep CFR work and adds two non-trivial complications over PLO4:

Hand-evaluator combinatorics: C(5,2) × C(5,3) = 100 two-and-three combos per (hand, board) pair. The PLO5 evaluator (plo5_evaluator.py) computes all 100 and takes the best — this is not a phevaluator call, it’s its own thing.
Encoder cost: 5-card hands push state-feature dimensionality up, and naive equity computation per state explodes.

That second complication is what dominated Stage 6 engineering.

The encoder optimization that mattered

In the K=500 smoke profile, the original encoder ran at 2,620 µs / encode. A K=10,000 traversal at 300 iters would have taken 100 days. So I rebuilt the hot path:

Opp-value cache keyed by canonical board → distribution over opponent ranges; populated lazily, persists across the iter
Numpy vectorization of the per-combo equity rollouts
Classify memoization of board-class lookups (flush / straight / paired structures)

K=500 diagnostic results, post-optimization:

	µs / encode	cache hits / misses
Cold cache (warm-up pass)	79.0	525 / 367
Warm cache (steady state)	44.8	892 / 0
Pre-optimization baseline	2,620	n/a

58× steady-state encoder speedup, 99.39% opp-value cache hit rate across a 1.46M-decision-point traverse, GPU forwards down to 0.8% of traverse wall. Cache works.

The honest readout: encoder is no longer the bottleneck

After the optimization, a single K=500 traverse pass took 703.7 s for 1,461,675 decision points = 481 µs / query. Of that, the encoder is ~45 µs (warm). The other ~436 µs / state is Python overhead in the traversal loop itself:

generator yield/send/stack save-restore across a 2.3M-call hot path
frozen-dataclass PLO5State reconstruction in apply_action
legal_action_mask recomputation
reservoir-buffer writes for traverser nodes
numpy mask/sigma allocations in the inner loop

GPU is essentially idle during traversal (0.8% of wall in forwards). Adding bigger nets or larger batches would barely move iter wall.

Component-level iter budget at K=500

component	wall	%
traverse × 2 players	1,407.4 s	98.0%
aux rollouts × 2,000	1.5 s	0.1%
V rollouts × 2,000	1.4 s	0.1%
train R × 800 (× 2)	9.6 s	0.7%
train S × 1,000 (× 2)	11.8 s	0.8%
train V × 500 (× 2)	3.2 s	0.2%
TOTAL	23.9 min/iter	100%

Extrapolated to K=10,000 × 300 iters: ~100 days vs. the 4–8 day target. 12–25× too slow. The encoder lever is exhausted.

The remaining levers, ranked

Approach	Speedup	Cost	Risk
(A) Multi-process traversal (8 workers)	6–8×	1–2 days; needs IPC for state batching	Memory amplification; pickling cost
(B) Reduce K from 10k → 2k	5×	trivial	Slower convergence per iter
(C) Reduce action grid 10 → 5–6 slots	2–3×	small game/encoder change	Loses pot-fraction granularity
(D) Iterative explicit-stack traversal	1.5–2×	half-day refactor	Code complexity

The current direction is (A) + (B) in tandem: 8-worker MP traversal at K=2,000.

Equity pretraining for warm starts

Stage 6d also pretrains the V/aux head on a 50k-state equity dataset (stage6d_equity_dataset.npz) generated by Monte-Carlo rollouts on canonical (hand, board) pairs. The pretrained checkpoint (stage6d_equity_pretrained.pt) lets the v3 ensemble start with non-random equity priors instead of bootstrapping them in the first 20 iters. The “BB60” variant in the v3 logs is this pretrained-init run.

Reference benchmark — Kuhn poker exploitability

Kuhn poker — CFR+ vs. reference, exact convergence to ε ≈ 0.

Stage 0 sanity check from the bottom of the arc: the tabular CFR+ implementation reproduces the published Kuhn equilibrium exactly. Every later stage’s pipeline is built on this foundation; if the floor is wrong, the ceiling is decoration.

What it demonstrates

Engineering a research project that can’t fit in RAM or wall-time at the obvious settings, and re-architecting until it does
Profile-first optimization: not “I optimized the encoder” but “here’s the 58× the cache buys, here’s the 9% of iter wall that’s left, here’s why we now need MP”
Knowing when to stop optimizing one lever and switch to another
Custom evaluator for a game with no off-the-shelf solver
Equity pretraining as a warm-start technique for neural CFR

Switchback Experiments on a Simulated Marketplace

Wed, 22 Apr 2026 00:00:00 GMT

In plain English

Imagine Uber wants to test a small price change. The obvious experiment: flip a coin for each rider — half see the new price (treatment), half see the old price (control). After a few weeks, compare conversion rates. Whichever arm did better wins.

This is wrong on a marketplace, and it’s wrong in a way that fools almost everyone. When a treated rider books, they tie up a driver — a driver that would have served the next control rider. Treatment doesn’t just affect the treatment group; it eats into the control group’s experience. The two arms aren’t independent. The conversion gap you measure is much larger than the actual effect of the change, because control’s number is artificially depressed.

I wanted to prove this with numbers, not just describe it. So I built a simulated rideshare marketplace where I knew the true effect of the price change exactly (because I picked it), then ran both the naive coin-flip experiment and the production-standard fix — a switchback design, where the entire marketplace flips between old and new prices in time blocks. Comparing them against ground truth shows the naive design is 208% biased and switchback recovers the true effect within 11%.

Switchback experiments are how Uber, Lyft, DoorDash, Instacart, and Airbnb actually run pricing tests. They’re rarely covered in coursework. This project is the worked walkthrough.

Bias vs. recovery: naive A/B vs. switchback (W = 30m / 120m).

Headline result

Design	Mean τ̂	Bias	% of true τ
Ground truth	+0.00796	—	—
Naive A/B	+0.02449	+0.01653	+208%
Switchback (W = 30m)	+0.01051	+0.00256	+32%
Switchback (W = 120m)	+0.00881	+0.00085	+11%

200 Monte Carlo replicates per design; each replicate is a 30-day simulated marketplace at λ = 5/min, N = 22 drivers, mean trip = 15 min, baseline conversion 0.30, per-rider lift τ = 0.03.

Two estimands, only one of which matters

The naive A/B isn’t merely noisy — it’s answering the wrong question. It estimates the conditional effect on a treated rider holding supply at the mixed-arm operating point (≈ 0.025), which lines up with the +0.0245 we observe. But the launch decision depends on the equilibrium effect: treatment for everyone vs. control for everyone, which is +0.008, because in an all-treatment world supply is more depleted than in the mixed world. The two estimands diverge whenever supply is finite.

Window-length bias-variance tradeoff

RMSE-optimal W ≈ 8× mean trip duration.

W (min)	Bias	Std	RMSE
5	+0.00880	0.00196	0.00902
30	+0.00243	0.00163	0.00292
120	+0.00085	0.00164	0.00184
240	+0.00057	0.00190	0.00199

Bias decays roughly geometrically in W; variance is roughly flat at this horizon. RMSE-optimal W ≈ 8× mean trip duration — that’s a generalizable heuristic, but the lesson is don’t import a fixed W from another company. Re-derive it at your own operating point.

Carryover diagnostic

Within-window carryover by previous-arm.

Within each window I bin riders by position from window-start and compare conversion rates across the four (this_arm, prev_arm) regimes. Control windows that follow a treatment window have depressed conversion in the early bins — drivers are still busy from the prior treatment regime. The signal is small (~0.4pp) but consistent.

Variance estimation

For W = 30 on 30 simulated days (1,440 windows, 216k riders), three SE estimators agree closely — naive per-rider, cluster-robust by window, and a 2,000-rep block bootstrap. The within-window dependence is weak enough at this λ that cluster-robust ≈ i.i.d., but I implemented all three so the diagnostic exists.

Power analysis: the practical cost of doing this honestly

Because the equilibrium τ is small (+0.008), even the right-design experiment is power-constrained. With switchback SE ≈ 0.0019 at W = 120, the 80%-power MDE at α = 0.05 is roughly 2.8 × SE ≈ 0.0053. The true effect sits just above the MDE — a 30-day switchback would detect it but with limited margin. A practitioner should plan for 6–8 weeks to shrink the SE further. Naive power calculations using the per-rider effect would tell you a few days suffices. They’d be wrong.

Sign-flip symmetry

If contamination is the real cause, flipping the sign of τ should flip the sign of the bias. It does:

τ parameter	Equilibrium τ	Naive bias	Switchback bias
+0.030	+0.0075	+0.0167	+0.0009
−0.030	−0.0114	−0.0148	−0.0002

The asymmetry in equilibrium τ is itself interesting — a price increase produces a larger-magnitude equilibrium effect than a comparable price decrease, because freed-up supply partially offsets a lift but reinforces a depression. Marketplaces amplify negative effects and damp positive ones.

Supply-scaling sanity check

Bias vanishes when supply is abundant.

The contamination bias vanishes as N → ∞. Confirms the mechanism is supply-side, not anything else.

Stack

Discrete-event simulator (event-driven, not time-stepped) — numpy, scipy, statsmodels
Cluster-robust SEs by window, block bootstrap on dependent data
Phases: zero-effect validation → main A/B vs. switchback Monte Carlo → variance + carryover → window sweep → figures
End-to-end runtime ≈ 6 minutes single-core

What it demonstrates

Identifying bias from interference / SUTVA violation
Implementing a real production technique used at every marketplace company
Bias-variance tradeoffs in experimental design under dependent data
Reading a result honestly: the naive A/B isn’t “noisy,” it’s answering a different question

References

Bojinov, Simchi-Levi & Shephard (2023), Design and Analysis of Switchback Experiments, Management Science.
DoorDash engineering blog, Switchback Tests and Randomized Experimentation Under Network Effects.

Crypto Strategy Discovery: Robust BTC & ETH Research

Wed, 15 Apr 2026 00:00:00 GMT

In plain English

A trading strategy is a rule for when to buy and sell — for example, “buy Bitcoin when its 50-day average crosses above its 200-day average, sell when it crosses below.” A backtest runs that rule against historical prices to see what the P&L would have been.

The dirty secret: it is trivially easy to invent a backtest that looks profitable but would lose money in real life. You try a hundred different rules, pick the best one, and report it as if you’d discovered it. You don’t include trading fees. You optimize the rule’s parameters on the same data you’re testing it on. Each of these adds a little bit of “lookahead” or “selection” bias, and stacked together they turn random noise into a Sharpe-2 strategy on paper.

This project hunts for real edges in Bitcoin and Ethereum trading by deliberately designing the experiment to fail when no real edge exists. The eight rules below are not technical curiosities — they are the difference between a project that produces honest answers and one that produces wishful thinking. Several of the strategies looked great in early phases and then died on the frozen test split, which is the project working correctly.

Anti-overfitting principles

Frozen out-of-sample (OOS) holdout. The final ~30% of history is never used for parameter selection or strategy choice. It is touched exactly once, at the end of each phase.
Walk-forward analysis. Parameters are re-fit on rolling windows; only next-window returns are recorded. No single-point fits.
Deflated Sharpe Ratio (DSR). Every reported Sharpe is deflated by the number of trials run, following Bailey & López de Prado (2014). A nominal Sharpe of 1.5 across 50 trials is not a discovery.
Realistic frictions. 10 bps per side (20 bps round-trip) + 5 bps slippage on every trade.
Parameter robustness. A strategy is only accepted if a neighborhood of parameters works — not a single sweet spot.
Minimum-trades guard. Strategies with < 30 trades on the test window are rejected for lack of statistical power.
Concentration check. If > 40% of profit comes from < 5% of trades, the strategy is flagged as fragile.
Buy-and-hold benchmark. Risk-adjusted outperformance vs. buy-and-hold, not absolute return, is the bar.

Phase progression

Phase	Focus
1–3	Trend, mean-reversion, breakout primitives
4	Candidate selection on validation
6	Carry / funding-rate signals
7	On-chain features (active addresses, whale flows)
8	Meta-ensemble of phase-1–7 survivors
9	Hybrid strategies blending vol-regime gating with carry/momentum
10	Production candidate — final OOS evaluation

OOS results

BTC holdout — strategy survives the frozen test split.

ETH holdout — comparable behavior on the cross-asset test.

Phase 10 production candidate.

What it demonstrates

Frozen-holdout discipline that catches lookahead bias by design — found and killed multiple “promising” strategies whose edge collapsed on the OOS split
DSR as a routine reporting metric, not a footnote
Cross-sectional analysis (BTC vs. ETH) to test whether an “edge” is asset-specific or generalizes
Honest reports: every phase has a REPORT.md with what survived, what didn’t, and why — including the dead ends

The point of the project isn’t the equity curves. It’s that the equity curves you see survived a process designed to murder them.

LEAP Trading Strategy: Leveraged Long-Dated Options Backtest

Sun, 12 Apr 2026 00:00:00 GMT

In plain English

A LEAP (“Long-term Equity AnticiPation”) is a long-dated stock option — typically a call option that expires 1–3 years out. Because options give you leverage, buying LEAPs is a way to get something like 3–5× the upside of a stock for a fraction of the capital. Retail finance Twitter loves them: “Why buy 100 shares of QQQ when you can buy a LEAP and get the same dollar exposure for a quarter of the cost?”

The pitch is correct that LEAPs are leveraged. The pitch is wrong that the leverage is free. Options decay over time even when the stock is flat (theta), they get crushed when volatility drops (vega), and rolling them when they expire costs spread and slippage every cycle. A naive LEAP strategy can underperform just holding the stock over long horizons, and the drawdowns when volatility spikes can be brutal — while the stock is making new highs, your LEAP basket can still be down 60%.

This project asks the question carefully: across many combinations of how-deep-in-the-money, how-long-until-expiry, and how-often-you-rebalance, are there any LEAP strategies that risk-adjust above just buying and holding the underlying? The answer turns out to be qualified — yes, but only in a small allocation inside a mostly-stock portfolio, and at the cost of much larger drawdowns. The all-LEAP and self-funded “infinite money glitch” designs popular online don’t survive an honest backtest.

Designs tested

Fixed-deposit LEAPs — $1k every 2 weeks into a fixed moneyness/tenor LEAP
Continuous DCA vs. lump + DCA vs. self-funded (no fresh deposits after year 1)
Blended portfolios — stock + LEAP, stock + LEAP rolled, stock + LEAP held-to-vertical
Barbell — small allocation to long-dated LEAPs + larger cash buffer
Improvements: drawdown-stable variants that exit on volatility regime change

Headline: self-funded one-year-deposit-only ($1k / 2wks for year 1, then ride)

Self-funded sweep — moneyness × tenor × IRR / max drawdown.

The middle panel is what matters. Self-funded LEAPs underperform stock-only on $/year across most moneyness × tenor cells once you factor in friction. The “wins” are concentrated in deep-OTM long-tenor cells — exactly the cells with the worst path-dependence and the largest drawdowns.

The right panel is the IRR delta vs. continuous DCA. DCA wins in 14 of 21 cells, often by 5–10 percentage points. The intuition is mechanical: continuous DCA averages your cost basis through volatility regimes and sells less of the long-vega exposure into vol crashes.

Drawdown stability

Self-funded drawdown profiles.

LEAP drawdowns are not just bigger than stock drawdowns — they are differently shaped. Stock drawdowns mean-revert; deep-ITM LEAP drawdowns become permanent capital loss when realized vol crushes during the holding window. The chart shows the period where buy-and-hold is making new highs and the LEAP basket is still down 60%.

Drip-DCA sweep

Drip-DCA over moneyness × tenor — IRR heatmaps.

For each (moneyness, tenor) cell I run a parameter sweep on the drip rate. The optimum rate is not a single point — it varies sharply with moneyness, which means a strategy chosen on one moneyness band will not generalize.

Blended portfolio: where the realistic wins are

Stock + LEAP blends with rolling vs. held-to-vertical.

The realistic finding from the project: a small LEAP allocation inside a mostly-stock portfolio risk-adjusts modestly above stock-only, especially in regimes where realized vol stays below 25%. The all-LEAP and self-funded designs do not. Drawdowns are 75–84% on the blended versions vs. ~33% on stock-only — that’s the cost of the IRR uplift to 23–26% vs. 9.4% for stock-only.

What it demonstrates

Real options backtest on historical chain data, not Black-Scholes synthesizing
Honest about tail risk: when the asymmetry of leverage cuts the wrong way, it cuts very deep
Sweep design that catches single-cell optima before they become “discoveries”
Comparing strategies on risk-adjusted terms, not headline IRR

Stack

Python — numpy, pandas, matplotlib for analysis
Historical option chain data + Monte Carlo for the synthetic regime stress tests
~30 distinct experiment scripts, each saving its grid + summary CSV alongside the figure

Deep CFR for Heads-Up No-Limit Hold’em

Wed, 08 Apr 2026 00:00:00 GMT

In plain English

I’m teaching a neural network to play heads-up no-limit Texas Hold’em (the two-player version of poker that’s been the long-standing benchmark for AI in games of imperfect information) at near-equilibrium strength.

A few things make this hard. Poker isn’t chess — you can’t see your opponent’s cards, so the optimal strategy is probabilistic (sometimes bluff, sometimes don’t, in carefully tuned proportions). The number of possible game situations is astronomically large. And the way you “solve” poker isn’t by predicting moves — it’s by computing a Nash equilibrium, the strategy that no opponent can exploit. The standard algorithm for this is Counterfactual Regret Minimization (CFR), and the modern neural variant is Deep CFR (Brown et al. 2019).

Big AIs like Libratus and Pluribus solve heads-up no-limit, but their published code is incomplete. So I’m building the whole thing from scratch in six stages, climbing from toy poker games up to the real one:

Kuhn (3 cards, 1 round) → Leduc (6 cards, 2 rounds) → Leduc-3 (3 players) → Limit Hold’em → No-Limit Hold’em (this stage) → 5-card PLO

This stage is the no-limit version. Each stage validates the algorithm on a smaller game before scaling up — if Kuhn doesn’t reach exact equilibrium, no-limit definitely won’t. The result here is a 200-iteration training run on the full game (52-card deck, 100 big-blind stacks, all betting actions): 17.86 hours of compute, no NaN/Inf in any loss, and a checkpoint-averaged final policy ready for the next stage.

Context (technical)

Stage 4 of a six-stage neural-CFR research arc. The goal of this stage is a complete Deep CFR blueprint — game logic, encoder, networks, training loop, evaluation — that serves as the substrate for Stage 5 (depth-limited online search) and Stage 6 (the PLO5 port).

Algorithm — external-sampling Deep CFR

Traverser recurses on all legal action slots at own nodes; opponent samples one action from current sigma; chance samples one outcome.
Regret target at traverser node: q(I,a) − Σ σ(a|I) q(I,a) over legal slots, exact via subtree recursion.
R-net retrained from scratch each iteration (Brown 2019 spec). The Phase-1 Leduc sanity showed a 1.25× exploitability improvement over warm-start, so I left it.
S-net warm-started across iterations; checkpoint-averaged over the last 20 snapshots (iters 10, 20, …, 200) at eval time.
V-net auxiliary (predicts expected utility from viewpoint features); trained per spec but not consumed by regret loss after Phase 1’s ESCHER investigation showed V-bootstrap diverged at this scale.
Linear-t weighting on replay-buffer regression.

The batched sigma scheduler (Phase 2a.5)

Profiling on the naive recursive traversal showed 86% of per-iter time in single-sample GPU forwards through SigmaCache._flush. The fix:

K concurrent generator-trajectories per iter
Each yields (infoset_key, features, legal_mask) when it needs σ
A scheduler collects pending yields per round, batches them into one GPU forward (~200 queries typical, 5,000+ in early iters), caches results, resumes
GPU forwards per iter drop from O(K × queries) to O(rounds)
Bit-equivalence verified at the buffer-statistic level vs. the unbatched reference

Result: 3.9× traversal speedup, 17.86h actual vs. 25h projection. Determinism preserved via per-trajectory RNGs (master + iter + player + traj_idx).

HUNL game

52-card deck, 100 BB stacks (200 chips), phevaluator showdown
4 streets, 7 canonical action slots with per-state legal mask:
- preflop SB first: {F, C, raise-to-4/5/6/7, AI}
- preflop re-raise: {F, C, 4×/5× bet-faced, AI}
- flop/river: {check/call, 0.33pot, 0.75pot, 1.5pot, AI}
- turn: {check/call, 0.5pot, 1.0pot, AI}
Card abstractions from Stage 3: 50 preflop, 1,000 flop, 200 turn, 200 river buckets — k-means on equity features

Networks & hyperparameters

3 networks per player (V, R, S) × 2 players = 6 total
4 hidden × 512 units, LayerNorm + ReLU, Linear out, float32
~1.33M params each, ~8M total

T=200, K=10,000
n_v=500, n_r=800 (from-scratch), n_s=1000
batch 4096, Adam lr 1e-3
buffer caps: R 500k, S 500k, V 200k
snapshot S every 10 iters; V/R every 50 iters
seed 42

The spec called for 5M / 5M / 2M buffer caps but in the first attempt those caps saturated host RAM at iter 2 (Python hit 23 GB, triggered swap, training-phase slowed 8×). Killed and restarted at 10× smaller caps; ran cleanly at ~12 GB Python RAM, no swap pressure.

Training results

200-iter HUNL training curves: V-loss decreases, R-loss increases, S-loss stable.

metric	value
wall time	17.86h
iterations completed	200 / 200
per-iter wall (mean / min / max)	321 / 175 / 989 s
traversal mean / training mean	224 / 98 s
NaN/Inf in any loss	no

Curve interpretation:

V-loss monotonically decreased ~33% (19,700 → 13,300). V predicts terminal utility from state features — a regression task that converges cleanly as the buffer fills.
R-loss increased 19,700 → 27,900 over training. Counter-intuitive but expected: R is retrained from scratch each iter to fit the instantaneous regret target q − v. As agents become more sophisticated, the regret targets become more diverse, and fitting them with a fresh 4×512 net on 500k samples gets harder. R-loss going up is consistent with the network correctly tracking a moving target — what would be alarming is R-loss going up and exploitability going up together. They don’t.
S-loss stable, in spec.

What it demonstrates

Implementing Brown et al. (2019) Deep CFR end-to-end without published code
Profile-driven optimization: identifying the GPU-forward bottleneck and engineering a batched scheduler with bit-equivalence guarantees
Honest reading of training curves: knowing when increasing loss is fine
Memory engineering: catching swap-thrashing, bisecting buffer caps to stable RAM
Determinism under concurrency

Next stages

Stage 5: depth-limited online search at decision time (DeepStack-style continual re-solving)
Stage 6: port the whole pipeline to 5-card PLO with composition-dependent encoders — see the PLO5 project

Polymarket Research Toolkit

Mon, 30 Mar 2026 00:00:00 GMT

In plain English

Polymarket is a website where people bet real money on real-world questions: “Will Trump win the 2024 election?”, “Will Bitcoin be above $100k by year-end?”, “Will the Fed cut rates next meeting?” Each question has two sides — YES and NO — and the prices fluctuate between $0 and $1 based on what the market thinks the probability is.

If a market is mispriced — for example, NO is trading at $0.10 but the event has been almost certain for weeks — there’s potential profit in buying the cheap side. The question is: are these mispricings real, persistent, and tradeable after fees? Or do they look real in a backtest because the backtester is lying to you?

This project is a toolkit for answering that honestly. It does three things in order:

Scrapes every public number Polymarket exposes — every market, every historical price tick, every order book snapshot. Plus Kalshi (a US-regulated competitor) for cross-venue comparison.
Tests trading ideas against that historical record with a backtester deliberately designed to fail when no real edge exists.
Scans live for the few signals that survive the test, so they can actually be traded.

The interesting findings turned out to be negative — the most promising-looking strategy collapsed when tested honestly, for a specific data-quality reason explained below. That’s the project working as intended.

Anti-overfit methodology

Every result is structured to fail loudly when no real edge exists:

Walk-forward only. Strategies see prefixes of price series, never the future.
Discovery / test split at the universe level — the calibration strategy is fit on the first half of resolved markets and scored on the second.
Deflated Sharpe. When you test N strategies, the best-of-N is inflated by selection. Deflate by N before claiming anything (Bailey & López de Prado).
Conservative cost model. 1% taker fee + 0.5% half-spread per leg.
Trade-count floor. Anything with fewer than 100 holdout trades is reported as “no signal yet,” not as a result.

Strategy suite

Strategy	Hypothesis
`extreme_price_decay`	Buy NO when YES collapses near close — fade late confidence
`favorite_hold`	Buy YES when YES is persistently ≥ 0.95 near close
`longshot_bias`	Short the longshot — buy NO at 0.85–0.95
`complementary_arb`	YES + NO < $1 — needs the live book
`mean_reversion`	Fade single-bar 10c spikes mid-life
`calibration_edge`	Data-driven, fit on first half of universe only

Honest empirical findings

complementary_arb looked great in train, collapsed in test. Investigation: the training “edge” was a forward-fill artifact. Bar-resolution price-history shows YES + NO summing to anything between 0.5 and 1.7 because each leg’s prints don’t share timestamps. After bucketing to the hour and inner-joining, real imbalances ≤ 2c essentially never appear in bar data. The arb strategy can only work against the live book. Found because the test split was frozen.
Calibration analysis at the 24h horizon shows the 0–10% YES band actually resolves YES ~11% of the time (vs. 2.4% priced) — enough sample to be suggestive, not enough to bet on. Watch this band as more data accumulates.
Bar-data limitations. Hourly bars are too coarse for any real microstructure work; live websocket feeds are needed for liquidity / spread strategies.

Stack

requests + retry/rate-limit aware HTTP client; SQLite for markets / prices / books
Walk-forward engine with deflated Sharpe; reliability tables and Brier / log loss for calibration
Live-scan loop for complementary-pair edges
Six sprint reports + a research memo documenting the dead ends as carefully as the live ones

What it demonstrates

Treating a backtest as a hypothesis test, not a marketing screenshot
The discipline of letting your own strategies fail
Microstructure thinking: knowing the difference between bar data and the book

No-Bust 21st Century Blackjack — Monte Carlo + CDZ⁻ Solver

Wed, 25 Mar 2026 00:00:00 GMT

In plain English

California has weird gambling laws. To get around the prohibition on banked house games, casinos invented variants of blackjack with twisted rules — the most famous of which is No Bust 21st Century Blackjack. It’s blackjack, but several rules are different in ways that look small and turn out to matter a lot.

The biggest change: busting (going over 21) doesn’t always lose. If both you and the dealer bust, the one closer to 21 wins; if you’re closer, you actually push (get your bet back) instead of losing. That single rule shift means hitting on a hand that would normally be a clear stand can suddenly be correct, because busting carries an option value that doesn’t exist in standard blackjack. Other tweaks — surrender legal at any decision point, special rules after splitting aces, an unusual dealer-bust side bet — pile on top.

If you walk into a California card room and play with the basic strategy you learned from a Vegas chart, you’re playing the wrong game. The chart is wrong. The right strategy depends not just on your hand and the dealer’s up-card but on the exact composition of cards left in the shoe (the technical term is CDZ⁻, “composition-dependent zero-memory”), and no published blackjack table covers this rule set.

I built two things: a Tkinter GUI Monte Carlo simulator that plays out millions of hands with multi-process workers, and a CDZ⁻ exact solver that derives the EV-optimal action for any (hand, dealer up-card, deck composition) combination by full subtree expansion. Together they let you see whether a given configuration of bet, deck count, penetration, and side-bet inclusion is actually +EV, in this game, and what the optimal play looks like at every decision point.

The game (rules)

No-bust comparison rule (the namesake): when both player and dealer bust, dealer-closer-to-21 wins, player-closer-to-21 pushes (player saves the bet), tied → dealer wins. So busting isn’t terminal in the usual sense.
Surrender legal at any decision point — initial 2-card, mid-hand after any number of hits, on split sub-hands, after split-and-hit. Costs half the bet. Not legal after doubling, not on a post-split-aces sub-hand.
Split aces special rule: each post-split-aces sub-hand receives exactly one draw card, then stands. If that draw card is also an ace and max_splits not reached and the chart action is “split,” the sub-hand is re-split.
Configurable max_splits (default 3 → up to 4 sub-hands; set to 1 for double-deck).
Double-after-split (DAS) hardcoded on for non-ace splits.
Buster side bet: pays on dealer-bust by card-count (3–4 cards 2:1, 5 cards 4:1, 6 cards 16:1, 7 cards 50:1, 8+ cards 200:1).

These rule shifts make stock blackjack basic strategy wrong, sometimes by several EV percentage points per hand. The simulator’s reason to exist is solving the right strategy for this game.

What it does

CDZ⁻ exact solver: composition-dependent strategy solving — for each (player hand composition, dealer up-card, deck composition), compute the EV-optimal action (hit / stand / double / split / surrender) by full subtree expansion. CDZ⁻ means we account for what’s left in the shoe but stand-on-the-current-hand-only (no peek at future cards).
Numba-JIT’d hand play for the simulation loop. Hand-by-hand replay through the solved chart at simulator throughput, not interpreter throughput.
Multi-process Monte Carlo with a configurable number of workers — each plays a fresh shoe, results aggregated for variance estimation.
Tkinter GUI for live experimentation: configure rules, deck count, bet sizing, splits, DAS, surrender, buster bet — see EV per hand, hourly EV at a chosen pace, ROR for given bankrolls.

Implementation notes

8-deck shoe default, configurable down to 2-deck (forces max_splits = 1 to match house rules).
Penetration handling: shoe reshuffled at configurable penetration depth; the solver re-derives the chart at the post-penetration composition.
Surrender logic: separate code path because surrender’s legality interacts with double, split-aces, and the no-bust rule in non-obvious ways. Edge cases verified against published CDZ tables for non-California variants then extended.
Buster bet EV: computed analytically per dealer up-card from the conditional bust-card-count distribution. Exposed in the GUI alongside the main-bet EV so a player can see whether the side bet is +EV or −EV in their chosen composition.
One-click launchers (run_sim.bat / run_sim.sh) that auto-install dependencies on first run — the simulator ships to non-developer testers as a working binary, not a setup project.

Why this is worth doing

The standard published blackjack tables are wrong for this game. The no-bust rule alone changes the optimal stand-vs-hit threshold for stiff hands against high dealer cards, because busting carries a saved-bet option value. Surrender-at-any-decision-point creates a continuation-value calculation that doesn’t exist in standard rule sets. And the buster bet is a side-game with composition-sensitive EV that the casino doesn’t post.

Solving this isn’t a paper exercise — it’s the difference between playing the game at +EV (with proper composition-dependent strategy and selective buster betting in penetrated shoes) vs. the −EV outcome of stock-rule basic strategy.

Stack

Python 3.10+ — numpy, numba, matplotlib
Tkinter for the live GUI
Multi-process worker pool for Monte Carlo
Full memo (nobust21_sim.md) covering rules, GUI, every variable, and the variable/function index — written so a non-developer card-room player can use it

What it demonstrates

Composition-dependent solving from scratch (no off-the-shelf for this rule set)
Multi-process Monte Carlo with seedable RNG per worker
Numba JIT compilation of the inner play loop with measured speedup over pure Python
A shippable end-user tool (one-click launcher, GUI) — not just a research notebook

LSTM-Driven Poker Analytics & Bluff Prediction Platform

Sat, 08 Mar 2025 00:00:00 GMT

In plain English

When someone makes a big bet in poker, they’re either bluffing (their hand is weak and they want you to fold) or value-betting (their hand is strong and they want you to call). Telling the difference is the entire game. Skilled players use timing, bet sizing, board texture, and their opponent’s history of plays to make educated guesses.

This project asks: can a neural network learn to tell the difference, given the same information a human player has? I scraped over 7,000 real-money hands from PokerNow.club (a popular online play-money and small-stakes site, blinds from $0.25/$0.50 to $2/$5), engineered features that capture how each hand played out — bet sizes relative to the pot, decision times (a human takes longer when the decision is close), board texture (paired? flush-draw? Ace on board?), positional context — and trained an LSTM (a type of recurrent neural network designed for variable-length sequences) to predict, at the moment a player makes a big bet, whether it’s a bluff or a value bet.

Final test AUC: 0.77, meaning the model correctly distinguishes bluffs from value bets 77% of the time on hands it has never seen. The interesting part isn’t just the number — it’s which features the model relies on, which gives a quantitative picture of what tells human players are actually leaking at low-to-mid stakes.

Technical introduction

The system processes over 7,000 hands (with blinds from $0.25/$0.50 up to $2/$5) to engineer advanced features—such as bet ratios, log-transformed decision times, comprehensive board evaluations with Ace detection, and dynamic positional metrics. A custom LSTM model, utilizing dynamic bucketing to manage variable-length sequences, was developed to predict whether the villain’s betting action is a bluff or a value bet, achieving a test AUC of 0.77.

Output

Below is a screenshot from the model evaluation dashboard displaying the confusion matrix, ROC curve, and feature importance chart:

Models & Techniques Used

LSTM Network with Dynamic Bucketing: Processes variable-length sequences of poker actions.
Bidirectional LSTM Layers: Capture context from both the past and future actions.
Advanced Feature Engineering: Incorporates bet ratios, decision times (log-transformed), board evaluations (with Ace detection), and positional metrics.
Cross-Validation & Class Balancing: Ensures robust model performance despite class imbalance (52% bluffs).

Training

Data Preprocessing: Raw hand histories are cleansed, features are engineered, and sequences are built per hand. Numerical features are standardized and categorical features are one-hot encoded.
LSTM Model Training: The model is trained using a combination of Bidirectional LSTMs, dropout, batch normalization, and L1/L2 regularization. Training is optimized via early stopping and learning rate reduction with cross-validation.
Dynamic Bucketing: Instead of padding all sequences to a global maximum, hands are bucketed by similar sequence lengths to reduce wasted computation and improve training efficiency.

Requirements

Python 3.8+
TensorFlow 2.x
Pandas, NumPy, Scikit-Learn
Matplotlib, Seaborn (for visualization)

Pickleball Vision: CV-Driven Match Analytics

Fri, 07 Feb 2025 00:00:00 GMT

In plain English

Tennis broadcasts have shot tracking. Major League Baseball has Statcast. Pickleball, the fastest-growing sport in the US, has nothing — match footage is just video, with no automated stats overlaid.

This project takes a fixed-camera video of a pickleball match and turns it into an annotated broadcast with player tracking, ball tracking, court geometry, a top-down minimap, ball speed in mph, per-player movement distance in feet, and shot count. All of it is computed automatically from raw video — no sensors, no manually placed cameras, no Hawk-Eye-style installation. Just whatever phone or DSLR is filming the match.

The pieces are well-known computer vision tools assembled carefully: YOLOv8 detects players and the ball; a fine-tuned ResNet50 finds the court’s lines and corners; the corners give a homography (the math that converts between “pixels in the video” and “feet on the actual court”). Once you have that homography, every other measurement — speed, distance, minimap position — is just geometry.

The hard part isn’t the detection. It’s making the court geometry trustworthy under realistic camera angles, occlusion from players, and varying lighting. Without that, the speeds are made up. So most of the engineering is in the iterative homography refinement — the part that makes every “23 mph” number on the scoreboard true.

What’s in the output

Player boxes — only the players actually on court, filtered from raw YOLO person detections (spectators dropped via court-geometry containment + minimum-track-length thresholding)
Ball box + trail — smoothed and gap-interpolated trajectory with a fading trail
Court keypoints — 12-point grid (4 horizontal lines × 3 columns) regressed by a fine-tuned ResNet50
Minimap (top-right) — top-down 20×44 ft court showing each player’s foot position and the ball location, projected via homography
Scoreboard (top-left) — running shot count, current and max ball speed (mph), per-player speed and total distance (ft)
Shot markers — flash on screen when the ball is struck (velocity reversal near a player)

Pickleball CV output frame — player boxes, ball trail, court keypoints, minimap, scoreboard.

Court keypoint accuracy is the hard problem

A trained ResNet keypoint regressor can be 30–100 px off on unfamiliar camera angles. Player tracking is easy; getting the homography right is what makes every downstream metric (speed, distance, minimap projection) trustworthy. The pipeline applies multiple refinement strategies in priority order:

Manual override — if input_videos/keypoints.json exists, use it directly. Most accurate option for fixed-camera shots.
4-boundary detection — locate the 2 baselines + 2 sidelines (the strongest court features) by clustering Hough segments and picking the extreme y/x clusters. Inner lines (NVZ, net) are deliberately ignored to avoid the common “snap to net” failure mode. The 4 corners → homography → all 12 canonical keypoints.
Model-prior snapping — fall back to per-row line snapping driven by the ResNet’s prediction.
Iterative pixel-level optimization — runs after either (2) or (3): back-project every white-line pixel into court-feet, assign each to its nearest grid line, refit, recompute the homography. Typically converges to ~0.2 ft mean residual.

Auto-labelled training data

To grow the keypoint training set without hand-labelling:

python tools/generate_training_data.py --stride 6 --max-frames 60

Extracts every Nth frame, runs the 4-boundary detector + iterative refinement, and saves frame JPGs + LabelMe-format JSONs that match the existing dataset format so they can be merged and used to retrain the ResNet for better generalization. Frames where boundary detection fails (heavy player occlusion) are skipped, not given a bad label — generating bad labels would silently degrade the next training round.

Architecture

trackers/
  player_tracker.py    YOLO + court-aware filtering
  ball_tracker.py      anchor-based linker (high-conf seeds, forward/backward
                       propagation with per-frame max-step constraint)
  motion_ball.py       frame-diff fallback for frames YOLO misses
court_line_detector/
  court_line_detector.py  ResNet50 -> 12 keypoints (24 floats)
  refine.py               Hough-line snapping + 4-corner homography refit
mini_court/
  court_geometry.py    canonical keypoint -> feet, homography
  mini_court.py        top-down renderer
analytics/
  shot_detector.py     velocity-reversal-near-player heuristic
  speed.py             per-player and ball speeds in mph (via homography)

Improvements over the original

Higher inference resolution: imgsz=1280 (up from YOLO default 640) gives ~100% ball detection vs ~60%. First run is slow because YOLO runs every frame; detections cached to tracker_stubs/. Subsequent runs skip inference unless --no-cache.
Anchor-based ball linker: replaced naive nearest-neighbor frame-to-frame association with high-confidence detection seeding + forward/backward propagation under a per-frame max-step constraint (the ball cannot teleport). Halves the ID-switch rate on noisy chunks.
Court-aware player filter: drops person detections that fall outside the court polygon and tracks shorter than min-frames threshold. Spectators no longer pollute the metrics in stadium footage.
Iterative homography refit: the white-line pixel back-projection loop. Previously fixed at the model’s first-pass prediction; now self-corrects to ~0.2 ft mean residual.
Pure-Python test suite: 20 unit tests covering geometry, smoothing, shot detection, speed math — run without torch / YOLO weights, finish in well under a second. Catches regressions in the analytics pieces without paying for full inference.

Stack

Python 3.10+, ultralytics (YOLOv8), pytorch, opencv-python, numpy, pandas
Trained ball detector (models/yolo5_last.pt) on hand-labelled pickleball footage
Trained court keypoint model (models/keypoints_model.pth) — ResNet50 backbone + regression head

What it demonstrates

Multi-stage CV pipeline where each stage is testable in isolation
Homography-driven measurement: every ft / mph number is geometry, not guesswork
Auto-labelling loop that knows when to refuse to label
Caching strategy that makes iterative work tractable on a single machine

ORB Algorithmic Day-Trading System

Mon, 11 Nov 2024 00:00:00 GMT

In plain English

Day-traders have a strategy called the Opening Range Breakout (ORB). The idea: in the first 15–60 minutes of the trading day, the stock makes a high and a low. If the price later breaks above that opening high, you go long (bet it keeps rising). If it breaks below the opening low, you go short. The bet is that early-session momentum continues.

ORB is the kind of strategy you see all over finance YouTube, usually with a screenshot of one good month. The honest version is much less impressive: tested across years of TQQQ (a 3× leveraged Nasdaq ETF), pure ORB works on some days, gets stopped out on most, and slowly bleeds equity in months where the market just chops sideways without trending.

The interesting question isn’t “does ORB work” — it’s “on which days does ORB work?” Some days have the kind of one-directional momentum ORB needs; other days have nothing of the sort. If we could predict the difference at 9:45 AM (right after the opening range completes), we’d only take the trade when conditions favor it.

That’s a machine-learning problem. Given the morning’s features — pre-market range, gap from yesterday’s close, VIX, sector strength, day-of-week, etc. — predict whether this day belongs to the trade-this regime or the skip-it regime. I trained an XGBoost classifier on this, used it as a gate on the underlying ORB strategy, and got a +19.1% annualized return improvement vs. running ORB unfiltered. The gated version preserves the upside of trend days and sits flat through the chop.

System

Data pipeline (SQL) — minute-bar TQQQ feature engineering: pre-market range, prior-day close-to-open gap, ATR-normalized opening range, sector-relative strength, VIX regime, day-of-week, position in week, time-since-last-stop-out
ORB simulator — runs the strategy across multiple interval sizes (5/15/30/60-min ORB) plus a no-trade scenario for benchmark; produces per-day P&L tagged with the feature snapshot at decision time
XGBoost gating model — predicts whether today’s feature snapshot belongs to a profitable-ORB regime, with hyperparameter tuning and time-series cross-validation
Composite strategy — only takes the ORB signal when the gating model fires positive; falls through to flat otherwise

Result

Equity curve: gated ORB vs. buy-and-hold TQQQ.

+19.1% annualized return improvement over unfiltered ORB on the test split. The gated version clears buy-and-hold on a Sharpe basis, and crucially preserves performance in regimes where naked ORB blows up (Q4 2022 trendless / mean-reverting environment).

Improvements (post-original)

The original 2024 build was a single-model gated ORB. Subsequent improvements:

Time-series cross-validation instead of standard CV — the original had look-ahead leakage from random folds across non-stationary feature distributions
Feature drift monitoring — KL divergence between training-period and recent-period feature distributions, with a model-refit trigger when drift exceeds threshold
Cost model — added per-trade slippage + half-spread costs at TQQQ realistic levels; the IRR uplift held but the Sharpe improvement compressed, which is the honest signal
Multi-interval ensemble — instead of fitting one gating model per interval, train a single model that predicts the best interval for the day (multinomial), then trade that one
SQL pipeline reorg — moved from per-day feature recomputation to incremental updates keyed on the latest bar; cut the daily feature build from minutes to seconds

Stack

Python 3.10+ — xgboost, pandas, numpy, scikit-learn
SQLite for the feature store (Postgres-ready schema)
Backtest reporting in Jupyter
TQQQ minute-bar history via financeds (custom data layer I built for this and the LEAP project)

What it demonstrates

Treating the strategy and the gating model as separable problems
Honest ML evaluation: time-series CV, drift monitoring, cost model
A working composite strategy, not just a backtest screenshot

Caveats

TQQQ-specific. The feature distribution and the optimal ORB interval don’t transfer cleanly to underlying QQQ or to single-name stocks; the gating model would need to be re-fit per asset
Survivorship-free underlying (TQQQ has been listed since 2010) so no listing-bias correction needed, but transfer to a name with corporate actions would require those adjustments

CNN-Based Age Prediction System

Tue, 17 Sep 2024 00:00:00 GMT

In plain English

Given a photo of a person’s face, predict their age. It’s a classic computer-vision benchmark and the kind of project most ML practitioners build at some point — the interesting parts are the loss function (age is continuous, so this is regression, not classification), the dataset bias (most face datasets skew young and Western), and the deployment surface (a model that lives in a Jupyter notebook isn’t a product).

I trained a convolutional neural network (a ResNet10 architecture, in PyTorch) on the UTK dataset — about 9,000 face images labeled with age, gender, and ethnicity. The model achieves an average prediction error of ±4 years on held-out test images. Then I wrapped it in a Streamlit web UI so anyone can drop in a photo and see the prediction live, instead of needing to run the model from a notebook.

The hyperparameter tuning improved accuracy by 28% over the initial baseline — most of that came from learning-rate scheduling and proper regularization, not from fancy architecture changes.

Technical introduction

This project leverages a CNN-based approach to predict age from facial images using PyTorch and the UTK dataset. Utilizing a ResNet10 architecture, the model processes over 9,000 images to achieve an average prediction error of ±4 years. Through extensive hyperparameter optimization—including learning rate scheduling and regularization—the model’s accuracy improved by 28%, and a Streamlit UI was developed for real-time demographic analysis.

Output

Here is a screenshot from the Streamlit UI demonstrating real-time age prediction:

Age Prediction UI Screenshot

Models Used

ResNet10 CNN Architecture for age prediction
PyTorch for model training and inference
Streamlit for deploying a real-time user interface

Training

CNN Model Training
- Includes data preprocessing, model training, and hyperparameter tuning.

Requirements

python 3.8+
pytorch
torchvision
pandas
numpy
streamlit
matplotlib or seaborn (for visualization)

Ethereum Smart Contract for NFT Generation & Minting

Fri, 15 Dec 2023 00:00:00 GMT

In plain English

An NFT (“non-fungible token”) is a unique digital item — usually a piece of art — whose ownership is recorded on a blockchain. NFT collections are typically generated programmatically: artists design a small number of “traits” (different hats, eyes, backgrounds, accessories), and a script combines them randomly to produce thousands of unique pieces. The collection is then put on a smart contract — a small program living on the Ethereum blockchain — that lets people pay to “mint” (claim) one of the pieces.

This was a 10,000-piece NFT collection (“Pimpin’ Pandas”) I designed end-to-end:

Image generation pipeline in Python that combined hundreds of trait layers into 10,000 unique pandas with no duplicates and 99.9% metadata integrity (every piece’s recorded traits actually match its image).
Ethereum smart contract in Solidity (the language used for Ethereum programs), implementing the ERC-721 standard (the technical spec for unique-item NFTs). Optimizations to the contract reduced the gas fees buyers paid to mint by 15%.
Minting UI so non-technical buyers could connect a wallet and claim a panda without needing to interact with the contract directly.

The collection successfully facilitated over 1,000 mint transactions.

Technical introduction

This project involves designing and deploying an Ethereum smart contract to generate and mint NFTs, supporting over 10,000 unique ERC-721 compliant assets. It features an optimized Python-based image generation pipeline that ensures diverse traits and 99.9% metadata integrity. Additionally, a user-friendly UI streamlines the minting process, reducing gas fees by 15% and enabling more than 1,000 efficient transactions.

Output

Here is a display of some of the NFT’s minted:

NFT Minting UI Screenshot

Technologies Used

Ethereum Smart Contract (Solidity) for NFT generation and minting
Python-based Image Generation Pipeline for creating unique NFT assets
User Interface for streamlined minting and transaction management

Deployment & Documentation

Smart Contract Deployment
Image Generation Pipeline Documentation

Requirements

Node.js and npm (for smart contract development)
Solidity compiler (e.g., via Hardhat or Truffle)
python 3.8+
web3.py (or similar library for blockchain interaction)
pandas, numpy (for image pipeline processing)