Subscription Churn Prediction & Causal Uplift Modeling

Causal Inference
Uplift Modeling
Product DS
Calibration

KKBox subscription churn pipeline (real Kaggle data, 6.8M members) — LightGBM at ROC AUC 0.866, then a causal uplift model that retains +92% more value than risk-targeting on the same budget. Engagement-decay naive estimate flips sign under adjustment.

Published

May 14, 2026

In plain English

Every subscription business — Spotify, Netflix, your phone carrier — eventually has to answer the same question: who do we spend retention money on?

The default playbook is “target the people most likely to leave.” A data team trains a churn predictor, sorts users by predicted churn probability, and hands the top-K names to marketing for a discount email. Intuitive. Standard. Mostly wrong.

The flaw is subtle: predicting who will leave is not the same as predicting who can be convinced to stay. The highest-churn users include two groups the discount can’t help — people leaving for reasons no email will fix (moved countries, found a competitor, dead account) and people who would have stayed without the discount anyway. The dollars get spent. The retention barely budges.

This project rebuilds the pipeline around the right question: for each user, how much would the intervention actually change their retention? That quantity has a name in causal inference — uplift — and there are estimators (T-learner, X-learner, R-learner) that target it directly. On a held-out month, ranking by uplift instead of risk captured +41% more retained subscriptions per dollar on the same budget.

The project also includes a worked example of a much more common analyst trap: a user-engagement feature that looks like a clear churn signal in the raw data ($-\(2.4pp), but **flips sign** (\)+$1.0pp) once you adjust for the obvious confounders. Headline narratives that ignore this fail in production. Anti-overfit hygiene is woven in: strict temporal CV, frozen holdout month, leakage-proof feature contract, calibrated probabilities.

Data

KKBox is a Taiwanese music-streaming subscription service; the WSDM 2017 Kaggle release is 6.8M members, 21.5M transactions, and 392M user-log rows (~31GB unzipped). The full panel is too large to feature-engineer comfortably, so the pipeline reservoir-samples 50K members through DuckDB before joining the other tables.

After temporal clamping (drops a label-contaminated band at the tail of the data window), the model panel is 125,749 user × renewal-date rows over 25 months, with an 11.5% overall churn rate.

The final test month (2016-12) is split off as a frozen holdout before any modeling decisions are made.

Panel overview — monthly churn rate, sample sizes, and feature coverage.

Temporal leakage is the only thing that matters

Every feature is a SQL aggregate of the form WHERE date < prediction_date — so a feature computed at the user’s renewal date can only look at strictly past data. Done wrong, this is the single biggest source of fake-good model results in the wild (“my model has 0.99 AUC!” is almost always a leak).

The diagnostic: if test AUC > 0.95 on this dataset, something is leaking. Observed test AUC is 0.866 — comfortably under the leakage tell.

Headline result — the churn model

Temporal CV with train = 21 months, valid = 2016-11, test = 2016-12 (6,852 rows).

Model Valid ROC AUC Test ROC AUC Test PR AUC Test log loss
Logistic regression 0.964 0.866 0.571 0.230
LightGBM 0.983 0.866 0.705 0.211

LightGBM and LR tie on ROC AUC; LightGBM wins on PR AUC and log loss because the threshold-free ranking is the same but it’s more confidently right where it matters.

Feature importance.

Calibration — well-calibrated raw, worse after post-hoc fix

Model Raw ECE Platt-scaled Isotonic
Logistic regression 0.018 0.027 0.027
LightGBM 0.019 0.025 0.022

Both models are well-calibrated out of the box — predicted probability ≈ empirical frequency to within ~2pp. Counterintuitively, applying Platt or isotonic re-scaling actually degrades calibration here. The reason: the validation month (2016-11) is a high-churn month (19.7%); the test month (2016-12) is a normal-churn month. A calibrator fit on the high-churn month over-corrects the normal-churn predictions. The lesson: post-hoc calibration only helps when valid and test distributions match.

We ship the raw scores downstream.

Calibration.

Business framing — risk-only targeting baseline

Assume \(\$0.50\) to email a user, \(\$20\) value per retained subscriber, 30% baseline success rate on the discount.

  • Optimal coverage K* = 20.4% of users, net value \(\$2{,}878\)
  • Top-20% by risk captures 80% of all churners

That’s the baseline. It’s a perfectly reasonable place for a product team to stop. We can do better.

Business value curve.

The uplift pivot

The intervention here is a simulated retention email with a known causal-uplift function: persuadable users sit in the middle of the engagement distribution, sure-things and lost-causes sit at the tails. Three uplift estimators rolled from scikit-learn:

  • T-learner — train one churn model on treated users, another on control, take the difference.
  • X-learner — uses the T-learner’s predictions to augment the training labels for a second-stage model. Tends to do better than T at small sample.
  • R-learner — Robinson residualisation: residualise outcome and treatment separately against confounders, regress the residuals against each other. Cross-fitted with a gradient-boosted second stage.

How well does each estimator recover the true uplift on the test month?

Policy Spearman vs τ Top-decile τ̄ Qini
T-learner +0.286 0.152 55.4
X-learner +0.316 0.166 59.9
R-learner +0.436 0.170 54.4
risk (GBM) $-$0.023 0.117 28.6
random $-$0.015 0.118 8.9

The risk policy’s rank-correlation with true uplift is essentially zero (slightly negative). This is the project’s point in one line: a top-class churn model is uncorrelated with how persuadable a user is. Risk and uplift are different objects and the discount budget cares about uplift.

The headline business number

At a 20% retention budget cap — the realistic decision regime:

Policy Net value Retained subs
Risk (GBM) \(-\$1{,}270\) 142
T-learner \(-\$548\) 178
X-learner \(-\$321\) 190
R-learner \(-\$96\) 201

R-learner nets \(\$1{,}174\) more retention value than risk-targeting on the same budget — a +92.4% lift — and retains +41% more subscriptions per dollar.

The absolute numbers are negative because at simulated parameters the intervention is barely net-positive even at the optimum; the relative lift over the risk baseline is what generalises to a real product.

Uplift headline.

The most informative finding — a sign flip

The dataset has a feature called engagement decay: did the user’s 30-day listening time drop by >25% vs the prior 30 days? Naive reading: people losing interest should be more likely to churn.

The data, before adjustment, says they’re less likely to churn ($-$2.4pp). The instinct is to throw out the feature. Don’t.

Estimator ATE on P(churn) 95% CI
Naive (raw) $-$0.024 \([-0.028, -0.020]\)
Adjusted (g-formula) +0.010 \([+0.006, +0.014]\)
IPW $-$0.002 \([-0.006, +0.002]\)
E-value (adjusted) 1.40

After adjusting for tenure, plan tier, prior cancellations, signup channel, and longer-window engagement baselines, the sign of the effect flips from $-$2.4pp to $+$1.0pp. This is a textbook Simpson-style reversal: the people with the biggest 30-day engagement drops are heavily skewed toward long-tenured loyalists with low baseline churn. The raw correlation reflects who has the feature, not what the feature does.

The E-value of 1.40 is small — it says a moderately strong unmeasured confounder could overturn this. Honest read: engagement decay has at most a small positive causal effect on churn, and the original $-$2.4pp reading was almost entirely bias.

Naive vs adjusted vs IPW.

What this leaves out

  • A real RCT on retention emails. The Path-A simulated intervention is what lets us verify the estimators recover the right ranking; it isn’t a substitute for an experiment.
  • Time-to-churn framing. The current target is binary churn-in-30-days; Cox / discrete-time hazard would handle right-censoring properly.
  • Multi-period budgeting. Each renewal is treated independently rather than as a sequential decision over a yearly budget.
  • LTV weighting. Optimises $/retained-subscription rather than $/retained-LTV.

What I’d build next

  • Survival analysis (Cox or discrete-time hazard) on the same panel — shifts the question from “will they churn” to “when”. Most cross-domain interview value.
  • LTV-weighted uplift — combine per-user uplift with tenure-based LTV.
  • Sequential bandit budgeting — frame retention spend as a monthly contextual bandit with cumulative uplift estimates as priors.

Stack

  • DuckDB (reservoir sampling + joins at 392M-row scale), pandas, scikit-learn, LightGBM
  • T/X/R-learners rolled from sklearn + LightGBM (causalml/econml have unreliable Python 3.13 wheels; pedagogically clearer this way)
  • 5-phase pipeline runs end-to-end via python scripts/run_all.py

What it demonstrates

  • The full product-DS surface: temporal CV, frozen holdout, calibration, business-value curves, feature importance.
  • The depth layer recruiters can actually go into: uplift modelling, Qini curves, IPW, E-values, sign-flip diagnosis.
  • An honest answer to “why uplift, not risk?” with a number attached.