MAISON-LLF, Explained — Dataset · Forecasting Paper

2 What is being predicted

Five recovery outcomes per assessment. Three are summed Likert questionnaires; two are physical-performance measures. The forecasting paper targets the two clinically central ones — OHS (function/pain) and SIS (social isolation).

OHS / OKS

Oxford Hip / Knee Score

12 items · each 0–4 · total 0–48 · higher = better joint function / less pain. Hip patients get OHS, knee patients get OKS (structurally identical).

e.g. “Climbing stairs is painful” → 0 (always) … 4 (never)

SIS

Social Isolation Score

6 items · each 1–5 · total 6–30 · higher = less isolated.

e.g. “I have people I can talk to” → 1 (never) … 5 (always)

TUG · CHAIR-STAND

Physical performance

tug Timed-Up-and-Go (seconds, lower = better) · chairstand sit-stands in 30 s (count, higher = stronger). One continuous value each.

Worked example — three patients, one assessment day

Items are summed into the total. This is exactly what the model's task heads must reproduce.

Patient	OHS items (12 × 0–4)	OHS total	SIS items (6 × 1–5)	SIS total
A — good hip, well-connected	3,4,3,4,4,4,3,4,3,4,3,4	43	5,5,4,5,5,4	28
B — poor hip, some isolation	1,2,2,2,3,2,1,2,1,2,1,2	21	3,4,3,4,3,2	19
C — ceiling on both	4,4,4,4,4,4,4,4,4,4,4,4	48	5,5,5,5,5,5	30

Per-item observed vs theoretical ranges — **Per-item ranges.** Some items barely move (`sis-02`: mean 4.5, σ≈0.6 — predicting the mean is near-perfect), others carry most of the variance (`ohs-11`: σ≈1.6). (real data)

Total-score distributions across 72 events — **Total-score distributions.** Stratifying by assessment number 1→4 gives nearly the same cohort distribution — variation is *between patients*, not across time. This is the seed of the whole ceiling story. (real data)

Scale	Between SD	Within SD	≥10-pt swingers
OHS (0–48)	7.81	3.10	4 / 18
SIS (6–30)	3.34	1.89	0 / 18
OKS (0–48)	9.44	3.16	5 / 18

5 The imputation problem (don't trust the CSV)

The released sensor matrix has zero NaN cells — which is impossible for real wearables (people forget to wear/charge the watch). The upstream pipeline silently filled the gaps, usually by carrying the last value forward. Roughly 17% of all cells are filled constants, not measurements, and it's wildly uneven across patients.

Exhibit A — subject 8's step count

56 consecutive days, every value exactly 1504. σ=0, 1 unique value, longest run = 56. A worn step-counter on a recovering patient never does this.

day	step-count	acc-mean	sleep-deep	note
1	1504	0.243	124	fill begins
2	1504	0.218	112	constant
3	1504	0.301	138
…	1504	…	…	(50 more identical)
56	1504	0.239	124	end

Other channels jitter day-to-day; only step-count is frozen — the signature of per-cell forward-fill.

The 3 audit heuristics

A. Long constant run — flag_long_run: longest identical run ≥ 14 days.
B. Low unique count — flag_low_unique: ≤ 5 distinct values across all 56 days.
C. Zero variance — flag_zero_sd: σ < 1e-6 (channel is a constant).

A cell tripping any flag is treated as imputed. Cross-checked on high-trust patients to keep false positives low.

Most-imputed channels

motion-max (14/18 patients), position-duration (10/18), sleep-snoring (8/18). Zero of 46 channels are clean cohort-wide.

Per-subject trust score

% of a patient's 46 channels passing all 3 audits. Hover a bar. Two training subjects (8, 18) are >70% imputed — a sensor model is largely fitting constants for them.

train val test

Cleaning before/after — **Cleaning audit.** Left: the 10 most-flagged channels and how many subjects they hit. Right: per-subject trust after masking. Crucially, masking ~17% of cells leaves persistence NMAE essentially unchanged — **the ceiling is structural, not an imputation artifact.** (real data)

7 Inside the model, stage by stage

Stage I

CatBoost + SHAP feature ranking

Separate CatBoostRegressor for OHS and SIS over 46 sensor + 7 demographic features.
Rank each feature by mean |SHAP| (importance) — computed on training data only.
Surfaces which channels carry recovery signal (acceleration-mean, sleep, step/motion counts on top).

Stage II

Feature gating + GRU forecaster

Grid-search top-M sensor / top-D demographic features → best = 23 sensor, 0 demographic; union over OHS&SIS = 30 sensor features.
Learnable feature gate w_g⊙z, initialised from SHAP, softplus-normalised to unit mean.
Linear projection → LayerNorm → GELU → dropout → GRU encoder (packed, masked, variable length).
Task-specific recency-biased attention (separate for SIS & OHS) over GRU states.
Two heads: SIS → 6 item scores, OHS → 12 item scores; sum → totals.

Stage III

Evaluation

Held-out 3 test patients (12 examples), never seen in training/selection.
Per-item and total-score MAE, RMSE, R², Pearson r — separately for OHS and SIS.
Three training configs compared: joint (multi-task) vs OHS-only vs SIS-only.

Why each component is there

Component	Problem it targets	Mechanism
SHAP feature gate	46 noisy channels, most useless at N=18	Down-weights low-value features instead of hard-dropping; differentiable, trained end-to-end
Recency-biased attention	Recovery is non-stationary; recent days matter more	Learnable decay weights more recent time-steps; one pattern per outcome
Masked / packed sequences	Patients have different history lengths (expanding window)	Pad + mask so attention ignores absent steps
Multi-task heads	Few labels per outcome	Shared encoder ⇒ OHS supervision regularises SIS and vice-versa
Item-level Smooth-L1	Totals are sums of ordinal items	Predict items, sum to totals; robust loss vs outliers

Training: AdamW, lr 2e-3, hidden dim 24, dropout 0.3, ≤500 epochs, early-stop patience 45, grad-clip 1.0, seed 42.

SHAP top features (which sensors the model leans on)

From the paper's SHAP summary plots — the consistently high-ranked sensor families. Demographics ranked low enough that the grid search kept zero of them.

→ OHS → SIS

Indicative ranking drawn from the paper's Fig. 2 (acceleration / movement-event bins, sleep, step & motion counts, heart-rate dominate). Heights are illustrative ordinal ranks, not raw SHAP magnitudes.

8 The ML toolkit, explained from scratch

The pipeline stacks five ideas that each deserve a plain-language explanation: gradient boosting (CatBoost), SHAP attribution, learnable feature gates, the GRU recurrent network, and attention. Here is what each one actually is and why it's used.

CatBoost

SHAP

Feature gating

GRU

Attention

CatBoost = gradient-boosted decision trees

Decision tree: a flowchart of yes/no questions on features (“is mean step-count > 3000?”) ending in a numeric prediction at each leaf. One tree is weak and overfits.

Gradient boosting: build many small trees in sequence, where each new tree is trained to predict the residual error left by all previous trees. Add them up: prediction = tree₁ + tree₂ + … Each tree nudges the prediction toward the truth a little. This is the dominant method for tabular data — usually beating neural nets when rows are few and features are heterogeneous.

What “CatBoost” adds (Yandex's variant): (i) native handling of categorical features via target statistics; (ii) ordered boosting — a permutation trick that computes each row's residual using only rows seen “before” it, which reduces the target-leakage/overfitting that plain boosting suffers on small data. That small-N robustness is exactly why the paper uses it.

Its role here

CatBoost is not the forecaster. It is used only in Stage I as a quick, strong tabular model whose predictions can be explained by SHAP — to rank which of the 46 sensor + 7 demographic features matter, before the GRU is built. A separate CatBoostRegressor is fit for OHS and for SIS.

SHAP = SHapley Additive exPlanations

A model like CatBoost is a black box: it gives a number, not a reason. SHAP answers “how much did each feature contribute to this prediction, vs the average prediction?”

The Shapley value comes from cooperative game theory (Lloyd Shapley, 1953). Treat each feature as a “player” in a game whose “payout” is the model's prediction. A feature's Shapley value is its average marginal contribution across every possible order in which features could be added to the model. It is the unique attribution that is fair (efficiency: contributions sum to the prediction; symmetry; dummy features get zero). SHAP computes these efficiently for trees.

Reading the plot (paper Fig. 2 below): each row is a feature, each dot is one patient-day. Dot position = that feature's SHAP value (push toward higher/lower predicted score); colour = the feature's value (red high, blue low). Features are sorted by mean |SHAP| — the average magnitude of influence. The paper takes that ranking to pick the top-M features.

Paper Figure 2: SHAP summary plots — Paper Fig. 2 — SHAP summaries: (a,b) top-20 sensor features for OHS / SIS; (c,d) the 7 demographic features. Acceleration / movement-event bins, sleep, step & motion counts, and heart-rate dominate; demographics rank low. (figure from the paper)

⚠ The leakage the rebuttal had to fix

In the original submission, SHAP was computed on the full dataset — so test-set information influenced which features were selected, contaminating the “leakage-safe” claim. Reviewer 2 caught this. The camera-ready recomputes SHAP on the training split only. The authors declined to compute SHAP per-fold (it would give different rankings per fold and need fold-specific pipelines); they use one fixed training-only ranking instead. See §10.

Learnable feature gates

With 46 noisy channels and ~50 training events, feeding everything to the network invites overfitting. Two crude options: drop low-SHAP features entirely (loses information, the cutoff is arbitrary) or weight them all equally (lets noise through). A gate is the soft middle.

Each input feature gets a learnable weight, and the gated input is the element-wise product: z̃ = w_g ⊙ z. So feature j enters the model scaled by w_g[j]. The network can learn to shrink a useless channel toward 0 and amplify a useful one — and because it's just multiplication, the weights are trained jointly with everything else by gradient descent.

Two design choices that matter:

SHAP initialisation. The gates start at the SHAP importances rather than random/uniform — a warm start that points the model at the features Stage I already found informative.
Softplus + unit-mean normalisation: w_g = softplus(θ) / mean(softplus(θ)). Softplus keeps weights positive; dividing by the mean fixes the average gate at 1, so the gate can re-weight features relative to each other but cannot just globally rescale the whole input (which would fight with the network's own scaling and destabilise training).

Honest caveat

This is a sensible regulariser, but it is also an extra set of trained parameters on a tiny dataset, and the paper reports no ablation isolating the gate's contribution — so we can't tell how much it actually helps vs. SHAP-based hard selection alone.

GRU = Gated Recurrent Unit (a kind of RNN)

The problem: a patient's history is a sequence of daily vectors of varying length. You want one summary that respects order (recent days may matter more) and handles any length.

RNN (recurrent neural network): walk through the sequence one day at a time, carrying a “hidden state” h — a memory vector. At each day, new input + previous memory → updated memory. The final memory summarises the whole history. Plain RNNs forget long-range information (vanishing gradients).

GRU fixes this with two gates that learn what to keep vs. overwrite at each step:

Update gate z: how much of the old memory to carry forward vs. replace with new info. (Near 1 → keep old memory unchanged across many steps → long-term memory.)
Reset gate r: how much of the old memory to forget when computing the candidate new memory.

So h_new = (1 − z)·h_old + z·h_candidate, with h_candidate built from the new input and a reset-gated copy of h_old. The GRU is a lighter cousin of the LSTM (2 gates vs 3, no separate cell state) — fewer parameters, which suits small data.

How the paper uses it: selected features → feature gate → linear projection + LayerNorm + GELU + dropout → GRU reads the expanding-window day sequence → produces one hidden state per day. Variable lengths are handled by packing + masking (padded days are ignored). Rather than only using the last hidden state, all states are pooled by attention (next tab).

Task-specific recency-biased attention

The GRU emits one hidden state per day. Attention collapses them into a single context vector as a weighted average, where the model learns the weights: c = Σ_τ α_τ · h_τ, with weights α summing to 1 over the valid days. Important days get larger α; padded days are masked to 0.

Recency bias: a learnable term makes recent days count more by default — natural for recovery, where the latest fortnight is more informative than week one.

Task-specific: OHS and SIS get separate attention modules over the shared GRU, so each outcome can weight the timeline its own way. The context vector then feeds a task head (OHS → 12 item scores, SIS → 6 item scores), which are summed into totals.

Why this design is reasonable

Every component targets a real small-data problem: gating & SHAP fight feature noise, the GRU+masking handle variable-length histories, attention+recency model non-stationary recovery, and multi-task heads share statistical strength. The ideas are sound; the open question (§10) is whether N=18 can train them.

9 The evaluation metrics, explained

The paper reports four numbers per outcome — MAE, RMSE, R², Pearson r — at two granularities (per-item and total-score). Knowing exactly what each means is essential to reading the results honestly, because they can disagree.

MAE — Mean Absolute Error

MAE = mean(|predicted − actual|). The average size of the miss, in the score's own units. Lower is better; 0 is perfect. Easy to read: “OHS total MAE 5.23” = off by ~5.2 points on a 0–48 scale on average.

Robust to outliers (no squaring). But it has no built-in reference — “5.2” is only good or bad relative to a baseline, which is the whole §10 argument.

RMSE — Root Mean Squared Error

RMSE = sqrt(mean((predicted − actual)²)). Like MAE but squares errors first, so big misses are punished much more. Lower is better. RMSE ≥ MAE always; a large gap between them signals a few large errors (here: one atypical test patient).

R² — coefficient of determination

R² = 1 − SS_res / SS_tot = the fraction of variance the model explains relative to just predicting the mean. 1 = perfect; 0 = no better than predicting the mean; <0 = worse than the mean.

This is the key one. The paper's R² values are negative (e.g. OHS total −1.06). Negative R² means the GRU's predictions are worse than a flat line at the cohort average — it is not just imperfect, it actively underperforms the most trivial baseline on variance-explained.

Pearson r — correlation

r ∈ [−1, 1]: do predictions and truth move together, regardless of offset or scale? +1 = perfect ranking, 0 = none, −1 = inverted.

r ignores systematic bias — you can have decent r and terrible R² if predictions track the trend but sit consistently too high/low. That is exactly the paper's OHS total: r=+0.53 but R²=−1.06 (right order, wrong level). And sis-02 has r=−0.79: predictions move the wrong way.

Per-item vs total-score — and why they differ

Per-item: score each individual questionnaire item (12 OHS items 0–4; 6 SIS items 1–5) and pool metrics across all items. Fine-grained — “can the model get each question right?”

Total-score: sum the predicted items into the questionnaire total (OHS 0–48, SIS 6–30) and score that. This is the clinically meaningful number.

Why total looks better than per-item

Summing 12 (or 6) noisy item predictions lets independent item errors partially cancel (some too high, some too low), so the total's correlation rises even when items are individually shaky. In the paper, OHS total r (+0.53) ≫ OHS per-item r (−0.04). The flip side: a good total can hide that the model isn't really predicting the items — aggregation masks per-item failure rather than fixing it.

10 What the paper found

The headline is an internal comparison: joint multi-task training beats training each outcome alone on most error metrics. Absolute performance stays limited by the tiny cohort.

Multi-task vs single-task — total-score MAE (lower better)

single-task multi-task (joint)

OHS total MAE 5.693 → 5.233 (−8.1%); SIS total MAE 3.103 → 2.492 (−19.7%). SIS total Pearson r flips from −0.209 (single) to +0.278 (joint) — the clearest sign OHS supervision helps SIS.

Full metric table (test, multi-task)

Outcome	level	MAE	R²	r
OHS	per-item	1.046	−0.43	−0.04
OHS	total	5.233	−1.06	+0.53
SIS	per-item	0.868	−0.17	+0.28
SIS	total	2.492	−0.30	+0.28

Total-score r > per-item r: summing 12/6 items cancels item noise and lets the coarse recovery trend show. R² stays negative — predictions track order but carry a systematic offset.

Per-item Pearson r — very uneven

Hover bars. A few items are learnable (ohs-12: r=0.84, the only positive-R² item); some are anti-correlated (sis-02: r=−0.79 — sensors mislead for that social item).

Per-item forecasting — the detail behind the averages

Pooled item metrics hide enormous spread. Reading Table 3 item-by-item is what reveals whether the model learned anything transferable.

The few that work, the many that don't

ohs-12 is the one real success — the only item with positive R² (+0.353) and a strong r (+0.839). Some recovery dimension it captures is consistently reflected in the sensors.
ohs-07, ohs-08, ohs-09 have positive r but negative R² — the model tracks their trend but with a systematic offset (right direction, wrong level).
ohs-11 is the worst (MAE 2.15, R²=−3.94) — and recall from §2 it carries the most total-score variance, so the model fails hardest exactly where it matters most.
sis-02 is actively misleading — r=−0.792, predictions anti-correlate with truth. The sensor cues informative for other items point the wrong way for this social item, which has no clear behavioural correlate in a wearable.

Why per-item is so much weaker than totals

Items are quantised to 5 levels and individually noisy; the network hedges toward the middle. Only when 12/6 items are summed do independent errors cancel and the coarse recovery trend emerge — lifting total-score r well above per-item r.

This is the central tension: the clinically useful total looks decent (r≈0.5) because aggregation launders item-level noise, not because the model predicts the underlying items. SIS is the giveaway — better total metrics than OHS, yet its items are weaker; the SIS total is averaging away errors, not reflecting genuine item skill.

Prediction-distribution analysis (paper Figs. 3–4)

Plotting predicted vs. ground-truth (the dashed diagonal = perfect) exposes how the model fails, not just how much.

Train-set predicted vs ground truth — **Train set.** Points roughly follow the diagonal — the model *can* fit the patients it has seen. SIS is tighter than OHS (narrower 6–30 range is an easier in-sample target); OHS underpredicts at the top end (≥35), hedging toward the mean for the few high scorers. (paper Fig. 3)

Test-set predicted vs ground truth — **Test set (3 unseen patients).** Far more scattered. P14 tracks the diagonal; P2 and P9 are systematically over-predicted at low truth — outputs cluster near ~26 regardless of the real value, i.e. the model falls back to a population-average guess for patients unlike anyone in training. (paper Fig. 4)

Residual analysis (paper Fig. 5)

Residual distributions for OHS and SIS — Residual = (truth − prediction) / σ. Dashed lines = mean residual. (paper Fig. 5)

OHS: systematic over-prediction

Residuals are almost all negative (mean −1.18σ) — the model predicts OHS scores that are consistently too high for unseen patients. P2 dominates (−1.3σ to −2.0σ across the range). A consistent bias like this means the failure is a domain mismatch (test patients differ from train), not random noise — and bias is what destroys R² even when r looks okay.

SIS: a near-zero mean that lies

Mean residual is only −0.26σ — but that is cancellation, not calibration: P9 residuals are positive, P2's are negative (down to −2.3σ), and they average out. A small mean residual here must not be read as “well-calibrated”; per-patient it is failing in opposite directions.

The honest read (the paper states much of this itself)

Benefits are not uniform: multi-task doesn't improve OHS total-score r (single-task OHS is higher there). R² is negative almost everywhere, so item-level prediction is unreliable; the residuals show the dominant error source is patient-level domain mismatch, with a single atypical test patient (P2) swinging the 3-patient metrics. The honest framing — which the rebuttal forced into the conclusion — is feasibility + a leakage-safe protocol + a multi-task regularisation effect, not clinically reliable prediction.

11 Flaws, limitations & the rebuttal

The paper is methodologically careful in its framing but constrained by the data and by several design choices worth scrutinising. Some issues were caught by reviewers; the authors' responses (rebuttal) are folded in below.

A. Evaluation-design issues

No cross-validation — a single 3-patient test set

The model is evaluated on one fixed split (12 train / 3 val / 3 test). It does not use leave-one-subject-out or patient-grouped k-fold CV. With only 3 test patients (12 events), every reported number is one draw from a very high-variance distribution — and the residual plots show a single atypical patient (P2) dominates the metrics. A different 3-patient draw could easily flip “multi-task wins”.

Why group k-fold would have been better

Patient-grouped k-fold or LOSO rotates every patient through the test fold, so the estimate averages over all 18 people instead of betting on 3. It is the standard fix for tiny clinical cohorts and would have produced a confidence interval rather than a single fragile point estimate. The cost the authors cite (below) is feature-selection complexity, not correctness.

SHAP feature-selection leakage (caught in review)

Reviewer 2 noticed the central “leakage-safe” claim could be undermined by how features were chosen. The authors disclosed that in the original submission SHAP was computed on the full dataset — letting test-set information steer feature selection. The camera-ready recomputes SHAP on the training set only.

They declined per-fold SHAP (which pairs naturally with k-fold CV), arguing it yields different rankings per fold and needs fold-specific pipelines; instead they use one fixed training-only ranking. Defensible for simplicity, but it is also the reason the work stays on a single split rather than CV — the two limitations are linked.

B. Result-interpretation issues

Negative R² everywhere

Almost all R² values are below 0 — i.e. worse than predicting the cohort mean on variance-explained. Only ohs-12 is positive. The model captures order (positive r on some targets) but not level.

Beaten by persistence

As §12–13 show, the trivial “predict last/average score” baselines beat the GRU on the same metrics. The paper does not report total-score-vs-persistence; our companion work supplies it.

Conclusion overstated (caught)

The original conclusion sold “~one-point average error” as forecasting ability. Reviewer 2 noted that is ~25% of the Likert range, with R²<0 and a losing comparison to persistence. The authors agreed and revised to frame the work as a leakage-safe formulation, not reliable prediction.

C. Modelling / capacity issues (authors' own limitations)

Tiny cohort. 18 patients can't train a gated, attention-equipped recurrent net without overfitting — the large train-vs-test gap in the scatter plots is exactly that.
Unidirectional, low-dim GRU bottleneck. Hidden dim 24, one direction — long-range, multi-sensor interactions get squeezed into a small state.
Possible negative transfer. SIS and OHS share one encoder; when their temporal dynamics diverge, joint training can hurt one outcome (consistent with multi-task not helping OHS total-r).
Thin optimisation. Smooth-L1 only; no temporal-consistency or uncertainty-calibration loss. Dropout + weight decay alone may be too weak at this N.
No ablations. The contributions of the SHAP gate, recency attention, and multi-task heads are not isolated, so we can't attribute the (small) gains to any specific component.
Imputation unaddressed. The model trains on the released CSV, ~17% of which is filled constants (§5) — for the two worst training subjects it is largely fitting noise.

The peer-review exchange, summarised

Reviewer point	Authors' response
R1: text implied 10 patients (old dataset version)	Clarified: 18-patient (newer) dataset used
R1: too many sections	Merged Discussion into Results; kept Conclusion / Future Work separate
R2: SHAP results not shown	Added as Figure 2
R2: was SHAP computed with test data? (leakage)	Conceded — original used full data; camera-ready uses training-only SHAP; per-fold declined
R2: conclusion overstates (1-pt error, R²<0, loses to persistence)	Agreed; revised to “leakage-safe formulation,” not reliable prediction
R2: SIS = “Scale” vs “Score” inconsistency	Standardised terminology
R2: SHAP/ICC never expanded; SIS citation wrong (Mick vs Nicholson)	Expanded SHAP; dropped ICC; corrected citation to Nicholson et al.

Net: reviewers did not dispute the method's design; they forced honesty about leakage and about how the headline number was sold. The revised paper is a sounder, more modest version of the same contribution.

12 Our companion characterisation work

In parallel we asked a different question: not “what's the best predictor” but “how much is even predictable, and what should the next team not waste time on?” That produced four claims that survive a strict honest-evaluation protocol.

① Variance budget = a hard ceiling

Shrout–Fleiss ICC + cluster bootstrap: 66–91% of score variance is between-patient. A patient-blind model can address at most a few % — persistence captures the rest for free.

② Sub-item battery can shrink

Pure psychometrics (Cronbach α + greedy selection + bootstrap stability): SIS 6→5 robustly, OHS 12→8 mostly, OKS does not shrink on this cohort. ~33% less patient burden, defensible.

③ Upstream-imputation discovery

0 NaN → ~17% of cells are filled constants; 2 training subjects <30% trusted. We ship a per-cell trust mask + a re-runnable clean pipeline.

④ Persistence ceiling holds

On clean data, persistence is unbeaten by all 8 pre-registered univariate tests and by a multivariate ridge across 5 targets (0/5 beat it; mean Δ = +0.0046 NMAE, i.e. worse).

The honest-evaluation protocol — 6 real catches

Each is a claim we nearly shipped before the protocol caught it. This is itself a contribution: a checklist for the next team.

#	Initial claim	Verdict after stress-test
1	chair-stand ← heart-rate, LOSO R²=0.154 (best of 552 scans)	test R²=−0.002 · withdrawn (multiple-testing artifact)
2	seq2seq GRU beats persistence +16% at 3rd visit	2/4 test subjects drove it, perm p=0.10 · reduced
3	1/8 pre-registered tests Bonferroni-significant on train+val	0/8 survive on frozen test
4	high ICC; sleep-composition; subject-8 “recovers without changing activity”	naive-ICC inflation; parts/total=0.83 not 1; imputation artifact · all withdrawn
5	univariate signal on clean data	best p 0.004 → 0.028; 0/8 survive correction
6	multivariate ridge/MLP “most generous fair test”	0/5 targets beat persistence

13 How the two works fit together

They are complementary, not contradictory. The forecasting paper builds a careful, leakage-safe method and shows multi-task learning helps relative to single-task. The characterisation work supplies the baseline the forecasting paper does not report — total-score vs persistence/subject-mean — and explains why any patient-blind sensor model is fighting uphill.

The missing baseline, side by side

When the GRU is put next to the trivial baselines on the same metrics, the persistence and subject-mean baselines win — exactly as the variance budget predicts. (Total-score rows were added by our work; the forecasting paper reports per-item only.)

Outcome · level	Method	MAE	R²	Pearson r
OHS per-item (0–4)	GRU + attn (paper)	1.060	−0.18	−0.02
	population mean	1.056	−0.09	−0.13
	persistence	0.618	+0.33	+0.63
	subject mean (ours)	0.633	+0.37	+0.64
OHS total (0–48)	population mean	6.718	−0.11	−0.48
	persistence	4.172	+0.47	+0.71
	subject mean (ours)	4.160	+0.46	+0.70
SIS total (6–30)	population mean	3.052	−0.02	−0.19
	persistence	2.180	+0.32	+0.61
	subject mean (ours)	2.180	+0.38	+0.63

The GRU's per-item OHS R² (−0.18) is below even the population mean; persistence and subject-mean sit at R²≈+0.33–0.47. This is the variance budget made concrete.

What the forecasting paper does well

Kills the leakage that made earlier numbers optimistic; defines a clean 7-day-ahead protocol; demonstrates a genuine multi-task regularisation gain; ships SHAP-guided gating and recency attention as reusable ideas.

What the characterisation work adds

Quantifies the ceiling (ICC), supplies the trivial baselines the forecasting paper omits, exposes 17% upstream imputation, and packages a 6-catch honest-evaluation checklist so future teams don't chase artifacts.

The one-paragraph takeaway

MAISON-LLF recovery scores are dominated by who the patient is, not by week-to-week sensor dynamics, and ~17% of the released sensor data is filled constants. So a patient-blind model — however carefully built — is bounded to a few percent of explainable variance and loses to “predict the patient's last/average score.” The right contributions on data like this are (1) leakage-safe protocols and honest baselines, (2) characterisations that tell the field where the ceiling is, and (3) reusable method ideas (multi-task regularisation, SHAP gating) that will pay off once a larger, cleaner cohort exists.

1 The dataset at a glance

The shape of one patient

Why this size matters

The split used by the forecasting paper