← All projects
May 8, 2026 · Causal Inference / Uplift Modeling / Product DS / Calibration

Subscription Churn Prediction & Causal Uplift Modeling

KKBox subscription churn pipeline (real Kaggle data, 6.8M members) — LightGBM at ROC AUC 0.866, then a causal uplift model that retains +92% more value than risk-targeting on the same budget. Engagement-decay naive estimate flips sign under adjustment.

Most subscription churn pipelines target the highest-risk users with retention offers. That’s the wrong target. Predicting who will leave isn’t the same as predicting who can be persuaded to stay; the top-risk segment is full of people leaving for reasons a discount can’t fix (moved countries, found a competitor, dead account) plus people who would have stayed without the offer anyway. The dollars go out, retention barely moves.

This pipeline targets uplift directly: for each user, how much would the intervention actually change their retention? T-learner, X-learner, and R-learner estimators give per-user treatment effects on the KKBox panel of 6.8M members. Ranking by uplift instead of risk captured 41% more retained subscriptions per dollar on the same budget on a held-out month.

A second worked example: a user-engagement feature that looks like a clear churn signal in the raw data (-2.4pp) flips sign (++1.0pp) once the obvious confounders are adjusted for. Headlines that ignore that fail in production. Anti-overfit hygiene is built in throughout: strict temporal CV, frozen holdout month, leakage-proof feature contract, calibrated probabilities.

Data

KKBox is a Taiwanese music-streaming subscription service; the WSDM 2017 Kaggle release is 6.8M members, 21.5M transactions, and 392M user-log rows (~31GB unzipped). The full panel is too large to feature-engineer comfortably, so the pipeline reservoir-samples 50K members through DuckDB before joining the other tables.

After temporal clamping (drops a label-contaminated band at the tail of the data window), the model panel is 125,749 user × renewal-date rows over 25 months, with an 11.5% overall churn rate.

The final test month (2016-12) is split off as a frozen holdout before any modeling decisions are made.

Panel overview — monthly churn rate, sample sizes, and feature coverage.

Temporal leakage is the only thing that matters

Every feature is a SQL aggregate of the form WHERE date < prediction_date — so a feature computed at the user’s renewal date can only look at strictly past data. Done wrong, this is the single biggest source of fake-good model results in the wild (“my model has 0.99 AUC!” is almost always a leak).

The diagnostic: if test AUC > 0.95 on this dataset, something is leaking. Observed test AUC is 0.866 — comfortably under the leakage tell.

Headline result — the churn model

Temporal CV with train = 21 months, valid = 2016-11, test = 2016-12 (6,852 rows).

ModelValid ROC AUCTest ROC AUCTest PR AUCTest log loss
Logistic regression0.9640.8660.5710.230
LightGBM0.9830.8660.7050.211

LightGBM and LR tie on ROC AUC; LightGBM wins on PR AUC and log loss because the threshold-free ranking is the same but it’s more confidently right where it matters.

Feature importance.

Calibration — well-calibrated raw, worse after post-hoc fix

ModelRaw ECEPlatt-scaledIsotonic
Logistic regression0.0180.0270.027
LightGBM0.0190.0250.022

Both models are well-calibrated out of the box — predicted probability ≈ empirical frequency to within ~2pp. Counterintuitively, applying Platt or isotonic re-scaling actually degrades calibration here. The reason: the validation month (2016-11) is a high-churn month (19.7%); the test month (2016-12) is a normal-churn month. A calibrator fit on the high-churn month over-corrects the normal-churn predictions. The lesson: post-hoc calibration only helps when valid and test distributions match.

We ship the raw scores downstream.

Calibration.

Business framing — risk-only targeting baseline

Assume \0.50toemailauser,to email a user,$20$ value per retained subscriber, 30% baseline success rate on the discount.

  • Optimal coverage K* = 20.4% of users, net value \2,878$
  • Top-20% by risk captures 80% of all churners

That’s the baseline. It’s a perfectly reasonable place for a product team to stop. We can do better.

Business value curve.

The uplift pivot

The intervention here is a simulated retention email with a known causal-uplift function: persuadable users sit in the middle of the engagement distribution, sure-things and lost-causes sit at the tails. Three uplift estimators rolled from scikit-learn:

  • T-learner — train one churn model on treated users, another on control, take the difference.
  • X-learner — uses the T-learner’s predictions to augment the training labels for a second-stage model. Tends to do better than T at small sample.
  • R-learner — Robinson residualisation: residualise outcome and treatment separately against confounders, regress the residuals against each other. Cross-fitted with a gradient-boosted second stage.

How well does each estimator recover the true uplift on the test month?

PolicySpearman vs τTop-decile τ̄Qini
T-learner+0.2860.15255.4
X-learner+0.3160.16659.9
R-learner+0.4360.17054.4
risk (GBM)-0.0230.11728.6
random-0.0150.1188.9

The risk policy’s rank-correlation with true uplift is essentially zero (slightly negative). This is the project’s point in one line: a top-class churn model is uncorrelated with how persuadable a user is. Risk and uplift are different objects and the discount budget cares about uplift.

The headline business number

At a 20% retention budget cap — the realistic decision regime:

PolicyNet valueRetained subs
Risk (GBM)-\1,270$142
T-learner-\548$178
X-learner-\321$190
R-learner-\96$201

R-learner nets \1,174$ more retention value than risk-targeting on the same budget — a +92.4% lift — and retains +41% more subscriptions per dollar.

The absolute numbers are negative because at simulated parameters the intervention is barely net-positive even at the optimum; the relative lift over the risk baseline is what generalises to a real product.

Uplift headline.

The most informative finding — a sign flip

The dataset has a feature called engagement decay: did the user’s 30-day listening time drop by >25% vs the prior 30 days? Naive reading: people losing interest should be more likely to churn.

The data, before adjustment, says they’re less likely to churn (-2.4pp). The instinct is to throw out the feature. Don’t.

EstimatorATE on P(churn)95% CI
Naive (raw)-0.024[0.028,0.020][-0.028, -0.020]
Adjusted (g-formula)+0.010[+0.006,+0.014][+0.006, +0.014]
IPW-0.002[0.006,+0.002][-0.006, +0.002]
E-value (adjusted)1.40

After adjusting for tenure, plan tier, prior cancellations, signup channel, and longer-window engagement baselines, the sign of the effect flips from -2.4pp to ++1.0pp. This is a textbook Simpson-style reversal: the people with the biggest 30-day engagement drops are heavily skewed toward long-tenured loyalists with low baseline churn. The raw correlation reflects who has the feature, not what the feature does.

The E-value of 1.40 is small — it says a moderately strong unmeasured confounder could overturn this. Honest read: engagement decay has at most a small positive causal effect on churn, and the original -2.4pp reading was almost entirely bias.

Naive vs adjusted vs IPW.

What this leaves out

  • A real RCT on retention emails. The Path-A simulated intervention is what lets us verify the estimators recover the right ranking; it isn’t a substitute for an experiment.
  • Time-to-churn framing. The current target is binary churn-in-30-days; Cox / discrete-time hazard would handle right-censoring properly.
  • Multi-period budgeting. Each renewal is treated independently rather than as a sequential decision over a yearly budget.
  • LTV weighting. Optimises /retainedsubscriptionratherthan/retained-subscription rather than /retained-LTV.

What I’d build next

  • Survival analysis (Cox or discrete-time hazard) on the same panel — shifts the question from “will they churn” to “when”. Most cross-domain interview value.
  • LTV-weighted uplift — combine per-user uplift with tenure-based LTV.
  • Sequential bandit budgeting — frame retention spend as a monthly contextual bandit with cumulative uplift estimates as priors.

Stack

  • DuckDB (reservoir sampling + joins at 392M-row scale), pandas, scikit-learn, LightGBM
  • T/X/R-learners rolled from sklearn + LightGBM (causalml/econml have unreliable Python 3.13 wheels; pedagogically clearer this way)
  • 5-phase pipeline runs end-to-end via python scripts/run_all.py

What it demonstrates

  • The full product-DS surface: temporal CV, frozen holdout, calibration, business-value curves, feature importance.
  • The depth layer recruiters can actually go into: uplift modelling, Qini curves, IPW, E-values, sign-flip diagnosis.
  • An honest answer to “why uplift, not risk?” with a number attached.