Interactive Explainer

MAISON-LLF: the data, the forecasting paper, and what actually wins

A visual walkthrough of the rehabilitation-monitoring dataset, the new multi-task forecasting model (Huo, Kyaw, Noh, Brown, Agarwal & Chan), and the companion characterisation work that measures the hard ceiling any sensor model runs into.

18 patients 46 sensor channels/day ~1,008 daily rows 72 assessment events OHS · SIS · OKS · TUG · chair-stand ARIAL @ IJCAI 2026

1 The dataset at a glance

MAISON-LLF (“Multi-modal AI for Smart-hOme moNitoring — Lower-Limb Fracture”) pairs continuous home wearable/ambient sensing with sparse clinical questionnaires for older adults recovering from a broken hip or knee. The defining feature of this dataset is its size: rich per-day signals, but only a handful of patients and visits.

0
patients (subjects)
0
daily sensor channels
0
patient-day rows
0
clinical assessment events

The shape of one patient

~56 days of daily sensor rows, punctuated by ~4 clinical visits where the questionnaires are filled in. Everything between visits is sensor data only.

One patient's monitoring window with assessment markers
One patient's monitoring window. Red markers = the clinical assessment days where SIS/OHS(or OKS) are recorded. (real data, explainer figure)

Why this size matters

  • Daily aggregates, not raw traces. The watch samples every second, but you receive one row per patient per day (step totals, mean HR, sleep-stage minutes…).
  • Sparse labels. Questionnaires arrive only every ~2 weeks, so each patient yields just ~4 supervised targets per scale.
  • The unit of evidence is the patient, not the row. ~18 independent people — tiny by deep-learning standards, normal by clinical-trial standards. Any claim must survive at this N.
  • Held-out by subject. Splits are patient-level so a model is judged on people it has never seen — the realistic deployment condition.

The split used by the forecasting paper

12 train · 3 validation · 3 test patients (patient-level, no overlap). The companion characterisation work used other patient-level splits (e.g. 14/4 with 4 frozen test IDs 6, 9, 13, 14) — the conclusions hold across splits.

2 What is being predicted

Five recovery outcomes per assessment. Three are summed Likert questionnaires; two are physical-performance measures. The forecasting paper targets the two clinically central ones — OHS (function/pain) and SIS (social isolation).

OHS / OKS

Oxford Hip / Knee Score

12 items · each 0–4 · total 0–48 · higher = better joint function / less pain. Hip patients get OHS, knee patients get OKS (structurally identical).

e.g. “Climbing stairs is painful” → 0 (always) … 4 (never)

SIS

Social Isolation Score

6 items · each 1–5 · total 6–30 · higher = less isolated.

e.g. “I have people I can talk to” → 1 (never) … 5 (always)

TUG · CHAIR-STAND

Physical performance

tug Timed-Up-and-Go (seconds, lower = better) · chairstand sit-stands in 30 s (count, higher = stronger). One continuous value each.

Worked example — three patients, one assessment day

Items are summed into the total. This is exactly what the model's task heads must reproduce.

PatientOHS items (12 × 0–4)OHS totalSIS items (6 × 1–5)SIS total
A — good hip, well-connected3,4,3,4,4,4,3,4,3,4,3,4435,5,4,5,5,428
B — poor hip, some isolation1,2,2,2,3,2,1,2,1,2,1,2213,4,3,4,3,219
C — ceiling on both4,4,4,4,4,4,4,4,4,4,4,4485,5,5,5,5,530
Per-item observed vs theoretical ranges
Per-item ranges. Some items barely move (sis-02: mean 4.5, σ≈0.6 — predicting the mean is near-perfect), others carry most of the variance (ohs-11: σ≈1.6). (real data)
Total-score distributions across 72 events
Total-score distributions. Stratifying by assessment number 1→4 gives nearly the same cohort distribution — variation is between patients, not across time. This is the seed of the whole ceiling story. (real data)

3 The 46 sensor channels

Four behavioural/physiological groups, all at daily resolution. Click a group to see what it contains and how it relates to recovery.

Channel count by group

Real sensor channels for one patient
Four real channels for one patient over ~56 days; dashed lines = assessment days. Note the flat stretch in motion-mean — a foretaste of the imputation problem in §5. (real data)

4 The structure that decides everything

Before any model: where does the variance live? If most of it is between patients (some people are simply sicker than others) rather than within a patient over time, then the trivial “predict last value” baseline already captures the lion's share — and a patient-blind sensor model is fighting over the scraps.

Per-patient total-score trajectories
Each line = one patient across 4 visits. Lines sit at very different heights (large between-patient spread); most are near-flat, but a real minority swing a lot. Persistence wins on the flat majority — and is provably wrong for the swingers, if only you knew who they were in advance. (real data)

Between- vs within-patient spread

Standard deviation of patient means (between) vs mean of within-patient SD. Between dominates on every scale.

ScaleBetween SDWithin SD≥10-pt swingers
OHS (0–48)7.813.104 / 18
SIS (6–30)3.341.890 / 18
OKS (0–48)9.443.165 / 18

The variance budget (ICC) — the ceiling, quantified

Shrout–Fleiss ICC(1,1) with cluster-bootstrap 95% CIs. %between is the share of total variance a patient-blind model can never recover; persistence captures it for free. The sensor-explainable headroom is at most a few percent on every target.

The hard implication

With 66–91% of variance between patients and Likert quantisation removing a little more, the sensor-explainable upper bound is ≈6% at best. On the clean data, honest test R² is ≤ 0 on every pre-registered hypothesis. This is a property of the data, not of any one model.

5 The imputation problem (don't trust the CSV)

The released sensor matrix has zero NaN cells — which is impossible for real wearables (people forget to wear/charge the watch). The upstream pipeline silently filled the gaps, usually by carrying the last value forward. Roughly 17% of all cells are filled constants, not measurements, and it's wildly uneven across patients.

Exhibit A — subject 8's step count

56 consecutive days, every value exactly 1504. σ=0, 1 unique value, longest run = 56. A worn step-counter on a recovering patient never does this.

daystep-countacc-meansleep-deepnote
115040.243124fill begins
215040.218112constant
315040.301138
1504(50 more identical)
5615040.239124end

Other channels jitter day-to-day; only step-count is frozen — the signature of per-cell forward-fill.

The 3 audit heuristics

  • A. Long constant runflag_long_run: longest identical run ≥ 14 days.
  • B. Low unique countflag_low_unique: ≤ 5 distinct values across all 56 days.
  • C. Zero varianceflag_zero_sd: σ < 1e-6 (channel is a constant).

A cell tripping any flag is treated as imputed. Cross-checked on high-trust patients to keep false positives low.

Most-imputed channels

motion-max (14/18 patients), position-duration (10/18), sleep-snoring (8/18). Zero of 46 channels are clean cohort-wide.

Per-subject trust score

% of a patient's 46 channels passing all 3 audits. Hover a bar. Two training subjects (8, 18) are >70% imputed — a sensor model is largely fitting constants for them.

train val test
Cleaning before/after
Cleaning audit. Left: the 10 most-flagged channels and how many subjects they hit. Right: per-subject trust after masking. Crucially, masking ~17% of cells leaves persistence NMAE essentially unchanged — the ceiling is structural, not an imputation artifact. (real data)

6 The new paper: multi-task recovery forecasting

“Multimodal Forecasting of Psychosocial and Functional Recovery in Older Adults After Lower-Limb Fracture” — Huo, Kyaw, Noh, Brown, Agarwal & Chan (University of Toronto · KMUTT). It reframes MAISON-LLF from concurrent estimation into a strict, leakage-safe forecasting task and proposes a multi-task model that jointly predicts future OHS and SIS.

The problem with the prior framing

Earlier MAISON-LLF analyses attached each clinical label to the sensor window preceding or around the assessment. That blurs into concurrent estimation and risks information leakage — using data from on/after the assessment day to “predict” it.

In deployment you must predict before the visit, from past data only.

The leakage-safe reformulation

Predict the score at assessment n using only observations up to day Tn − 7 (a 7-day-ahead forecast). Histories are expanding windows: the first visit forecasts from the first ~7 days, the second from ~21 days, and so on.

Strict chronological input→target separation, patient-level test split.

Pipeline overview figure from the paper
Paper Fig. 1 — the three-stage pipeline: CatBoost+SHAP feature ranking → feature-gated GRU forecasting model with task-specific attention → held-out evaluation. (figure from the paper)

7 Inside the model, stage by stage

Stage I

CatBoost + SHAP feature ranking

  • Separate CatBoostRegressor for OHS and SIS over 46 sensor + 7 demographic features.
  • Rank each feature by mean |SHAP| (importance) — computed on training data only.
  • Surfaces which channels carry recovery signal (acceleration-mean, sleep, step/motion counts on top).
Stage II

Feature gating + GRU forecaster

  • Grid-search top-M sensor / top-D demographic features → best = 23 sensor, 0 demographic; union over OHS&SIS = 30 sensor features.
  • Learnable feature gate wg⊙z, initialised from SHAP, softplus-normalised to unit mean.
  • Linear projection → LayerNorm → GELU → dropout → GRU encoder (packed, masked, variable length).
  • Task-specific recency-biased attention (separate for SIS & OHS) over GRU states.
  • Two heads: SIS → 6 item scores, OHS → 12 item scores; sum → totals.
Stage III

Evaluation

  • Held-out 3 test patients (12 examples), never seen in training/selection.
  • Per-item and total-score MAE, RMSE, R², Pearson r — separately for OHS and SIS.
  • Three training configs compared: joint (multi-task) vs OHS-only vs SIS-only.

Why each component is there

ComponentProblem it targetsMechanism
SHAP feature gate46 noisy channels, most useless at N=18Down-weights low-value features instead of hard-dropping; differentiable, trained end-to-end
Recency-biased attentionRecovery is non-stationary; recent days matter moreLearnable decay weights more recent time-steps; one pattern per outcome
Masked / packed sequencesPatients have different history lengths (expanding window)Pad + mask so attention ignores absent steps
Multi-task headsFew labels per outcomeShared encoder ⇒ OHS supervision regularises SIS and vice-versa
Item-level Smooth-L1Totals are sums of ordinal itemsPredict items, sum to totals; robust loss vs outliers

Training: AdamW, lr 2e-3, hidden dim 24, dropout 0.3, ≤500 epochs, early-stop patience 45, grad-clip 1.0, seed 42.

SHAP top features (which sensors the model leans on)

From the paper's SHAP summary plots — the consistently high-ranked sensor families. Demographics ranked low enough that the grid search kept zero of them.

→ OHS → SIS

Indicative ranking drawn from the paper's Fig. 2 (acceleration / movement-event bins, sleep, step & motion counts, heart-rate dominate). Heights are illustrative ordinal ranks, not raw SHAP magnitudes.

8 The ML toolkit, explained from scratch

The pipeline stacks five ideas that each deserve a plain-language explanation: gradient boosting (CatBoost), SHAP attribution, learnable feature gates, the GRU recurrent network, and attention. Here is what each one actually is and why it's used.

CatBoost
SHAP
Feature gating
GRU
Attention

CatBoost = gradient-boosted decision trees

Decision tree: a flowchart of yes/no questions on features (“is mean step-count > 3000?”) ending in a numeric prediction at each leaf. One tree is weak and overfits.

Gradient boosting: build many small trees in sequence, where each new tree is trained to predict the residual error left by all previous trees. Add them up: prediction = tree₁ + tree₂ + … Each tree nudges the prediction toward the truth a little. This is the dominant method for tabular data — usually beating neural nets when rows are few and features are heterogeneous.

What “CatBoost” adds (Yandex's variant): (i) native handling of categorical features via target statistics; (ii) ordered boosting — a permutation trick that computes each row's residual using only rows seen “before” it, which reduces the target-leakage/overfitting that plain boosting suffers on small data. That small-N robustness is exactly why the paper uses it.

Its role here

CatBoost is not the forecaster. It is used only in Stage I as a quick, strong tabular model whose predictions can be explained by SHAP — to rank which of the 46 sensor + 7 demographic features matter, before the GRU is built. A separate CatBoostRegressor is fit for OHS and for SIS.

SHAP = SHapley Additive exPlanations

A model like CatBoost is a black box: it gives a number, not a reason. SHAP answers “how much did each feature contribute to this prediction, vs the average prediction?

The Shapley value comes from cooperative game theory (Lloyd Shapley, 1953). Treat each feature as a “player” in a game whose “payout” is the model's prediction. A feature's Shapley value is its average marginal contribution across every possible order in which features could be added to the model. It is the unique attribution that is fair (efficiency: contributions sum to the prediction; symmetry; dummy features get zero). SHAP computes these efficiently for trees.

Reading the plot (paper Fig. 2 below): each row is a feature, each dot is one patient-day. Dot position = that feature's SHAP value (push toward higher/lower predicted score); colour = the feature's value (red high, blue low). Features are sorted by mean |SHAP| — the average magnitude of influence. The paper takes that ranking to pick the top-M features.

Paper Figure 2: SHAP summary plots
Paper Fig. 2 — SHAP summaries: (a,b) top-20 sensor features for OHS / SIS; (c,d) the 7 demographic features. Acceleration / movement-event bins, sleep, step & motion counts, and heart-rate dominate; demographics rank low. (figure from the paper)

⚠ The leakage the rebuttal had to fix

In the original submission, SHAP was computed on the full dataset — so test-set information influenced which features were selected, contaminating the “leakage-safe” claim. Reviewer 2 caught this. The camera-ready recomputes SHAP on the training split only. The authors declined to compute SHAP per-fold (it would give different rankings per fold and need fold-specific pipelines); they use one fixed training-only ranking instead. See §10.

Learnable feature gates

With 46 noisy channels and ~50 training events, feeding everything to the network invites overfitting. Two crude options: drop low-SHAP features entirely (loses information, the cutoff is arbitrary) or weight them all equally (lets noise through). A gate is the soft middle.

Each input feature gets a learnable weight, and the gated input is the element-wise product: z̃ = w_g ⊙ z. So feature j enters the model scaled by w_g[j]. The network can learn to shrink a useless channel toward 0 and amplify a useful one — and because it's just multiplication, the weights are trained jointly with everything else by gradient descent.

Two design choices that matter:

  • SHAP initialisation. The gates start at the SHAP importances rather than random/uniform — a warm start that points the model at the features Stage I already found informative.
  • Softplus + unit-mean normalisation: w_g = softplus(θ) / mean(softplus(θ)). Softplus keeps weights positive; dividing by the mean fixes the average gate at 1, so the gate can re-weight features relative to each other but cannot just globally rescale the whole input (which would fight with the network's own scaling and destabilise training).

Honest caveat

This is a sensible regulariser, but it is also an extra set of trained parameters on a tiny dataset, and the paper reports no ablation isolating the gate's contribution — so we can't tell how much it actually helps vs. SHAP-based hard selection alone.

GRU = Gated Recurrent Unit (a kind of RNN)

The problem: a patient's history is a sequence of daily vectors of varying length. You want one summary that respects order (recent days may matter more) and handles any length.

RNN (recurrent neural network): walk through the sequence one day at a time, carrying a “hidden state” h — a memory vector. At each day, new input + previous memory → updated memory. The final memory summarises the whole history. Plain RNNs forget long-range information (vanishing gradients).

GRU fixes this with two gates that learn what to keep vs. overwrite at each step:

  • Update gate z: how much of the old memory to carry forward vs. replace with new info. (Near 1 → keep old memory unchanged across many steps → long-term memory.)
  • Reset gate r: how much of the old memory to forget when computing the candidate new memory.

So h_new = (1 − z)·h_old + z·h_candidate, with h_candidate built from the new input and a reset-gated copy of h_old. The GRU is a lighter cousin of the LSTM (2 gates vs 3, no separate cell state) — fewer parameters, which suits small data.

How the paper uses it: selected features → feature gate → linear projection + LayerNorm + GELU + dropout → GRU reads the expanding-window day sequence → produces one hidden state per day. Variable lengths are handled by packing + masking (padded days are ignored). Rather than only using the last hidden state, all states are pooled by attention (next tab).

Task-specific recency-biased attention

The GRU emits one hidden state per day. Attention collapses them into a single context vector as a weighted average, where the model learns the weights: c = Σ_τ α_τ · h_τ, with weights α summing to 1 over the valid days. Important days get larger α; padded days are masked to 0.

Recency bias: a learnable term makes recent days count more by default — natural for recovery, where the latest fortnight is more informative than week one.

Task-specific: OHS and SIS get separate attention modules over the shared GRU, so each outcome can weight the timeline its own way. The context vector then feeds a task head (OHS → 12 item scores, SIS → 6 item scores), which are summed into totals.

Why this design is reasonable

Every component targets a real small-data problem: gating & SHAP fight feature noise, the GRU+masking handle variable-length histories, attention+recency model non-stationary recovery, and multi-task heads share statistical strength. The ideas are sound; the open question (§10) is whether N=18 can train them.

9 The evaluation metrics, explained

The paper reports four numbers per outcome — MAE, RMSE, R², Pearson r — at two granularities (per-item and total-score). Knowing exactly what each means is essential to reading the results honestly, because they can disagree.

MAE — Mean Absolute Error

MAE = mean(|predicted − actual|). The average size of the miss, in the score's own units. Lower is better; 0 is perfect. Easy to read: “OHS total MAE 5.23” = off by ~5.2 points on a 0–48 scale on average.

Robust to outliers (no squaring). But it has no built-in reference — “5.2” is only good or bad relative to a baseline, which is the whole §10 argument.

RMSE — Root Mean Squared Error

RMSE = sqrt(mean((predicted − actual)²)). Like MAE but squares errors first, so big misses are punished much more. Lower is better. RMSE ≥ MAE always; a large gap between them signals a few large errors (here: one atypical test patient).

R² — coefficient of determination

R² = 1 − SS_res / SS_tot = the fraction of variance the model explains relative to just predicting the mean. 1 = perfect; 0 = no better than predicting the mean; <0 = worse than the mean.

This is the key one. The paper's R² values are negative (e.g. OHS total −1.06). Negative R² means the GRU's predictions are worse than a flat line at the cohort average — it is not just imperfect, it actively underperforms the most trivial baseline on variance-explained.

Pearson r — correlation

r ∈ [−1, 1]: do predictions and truth move together, regardless of offset or scale? +1 = perfect ranking, 0 = none, −1 = inverted.

r ignores systematic bias — you can have decent r and terrible R² if predictions track the trend but sit consistently too high/low. That is exactly the paper's OHS total: r=+0.53 but R²=−1.06 (right order, wrong level). And sis-02 has r=−0.79: predictions move the wrong way.

Per-item vs total-score — and why they differ

Per-item: score each individual questionnaire item (12 OHS items 0–4; 6 SIS items 1–5) and pool metrics across all items. Fine-grained — “can the model get each question right?”

Total-score: sum the predicted items into the questionnaire total (OHS 0–48, SIS 6–30) and score that. This is the clinically meaningful number.

Why total looks better than per-item

Summing 12 (or 6) noisy item predictions lets independent item errors partially cancel (some too high, some too low), so the total's correlation rises even when items are individually shaky. In the paper, OHS total r (+0.53) ≫ OHS per-item r (−0.04). The flip side: a good total can hide that the model isn't really predicting the items — aggregation masks per-item failure rather than fixing it.

10 What the paper found

The headline is an internal comparison: joint multi-task training beats training each outcome alone on most error metrics. Absolute performance stays limited by the tiny cohort.

Multi-task vs single-task — total-score MAE (lower better)

single-task multi-task (joint)

OHS total MAE 5.693 → 5.233 (−8.1%); SIS total MAE 3.103 → 2.492 (−19.7%). SIS total Pearson r flips from −0.209 (single) to +0.278 (joint) — the clearest sign OHS supervision helps SIS.

Full metric table (test, multi-task)

OutcomelevelMAEr
OHSper-item1.046−0.43−0.04
OHStotal5.233−1.06+0.53
SISper-item0.868−0.17+0.28
SIStotal2.492−0.30+0.28

Total-score r > per-item r: summing 12/6 items cancels item noise and lets the coarse recovery trend show. R² stays negative — predictions track order but carry a systematic offset.

Per-item Pearson r — very uneven

Hover bars. A few items are learnable (ohs-12: r=0.84, the only positive-R² item); some are anti-correlated (sis-02: r=−0.79 — sensors mislead for that social item).

Per-item forecasting — the detail behind the averages

Pooled item metrics hide enormous spread. Reading Table 3 item-by-item is what reveals whether the model learned anything transferable.

The few that work, the many that don't

  • ohs-12 is the one real success — the only item with positive R² (+0.353) and a strong r (+0.839). Some recovery dimension it captures is consistently reflected in the sensors.
  • ohs-07, ohs-08, ohs-09 have positive r but negative R² — the model tracks their trend but with a systematic offset (right direction, wrong level).
  • ohs-11 is the worst (MAE 2.15, R²=−3.94) — and recall from §2 it carries the most total-score variance, so the model fails hardest exactly where it matters most.
  • sis-02 is actively misleading — r=−0.792, predictions anti-correlate with truth. The sensor cues informative for other items point the wrong way for this social item, which has no clear behavioural correlate in a wearable.

Why per-item is so much weaker than totals

Items are quantised to 5 levels and individually noisy; the network hedges toward the middle. Only when 12/6 items are summed do independent errors cancel and the coarse recovery trend emerge — lifting total-score r well above per-item r.

This is the central tension: the clinically useful total looks decent (r≈0.5) because aggregation launders item-level noise, not because the model predicts the underlying items. SIS is the giveaway — better total metrics than OHS, yet its items are weaker; the SIS total is averaging away errors, not reflecting genuine item skill.

Prediction-distribution analysis (paper Figs. 3–4)

Plotting predicted vs. ground-truth (the dashed diagonal = perfect) exposes how the model fails, not just how much.

Train-set predicted vs ground truth
Train set. Points roughly follow the diagonal — the model can fit the patients it has seen. SIS is tighter than OHS (narrower 6–30 range is an easier in-sample target); OHS underpredicts at the top end (≥35), hedging toward the mean for the few high scorers. (paper Fig. 3)
Test-set predicted vs ground truth
Test set (3 unseen patients). Far more scattered. P14 tracks the diagonal; P2 and P9 are systematically over-predicted at low truth — outputs cluster near ~26 regardless of the real value, i.e. the model falls back to a population-average guess for patients unlike anyone in training. (paper Fig. 4)

Residual analysis (paper Fig. 5)

Residual distributions for OHS and SIS
Residual = (truth − prediction) / σ. Dashed lines = mean residual. (paper Fig. 5)

OHS: systematic over-prediction

Residuals are almost all negative (mean −1.18σ) — the model predicts OHS scores that are consistently too high for unseen patients. P2 dominates (−1.3σ to −2.0σ across the range). A consistent bias like this means the failure is a domain mismatch (test patients differ from train), not random noise — and bias is what destroys R² even when r looks okay.

SIS: a near-zero mean that lies

Mean residual is only −0.26σ — but that is cancellation, not calibration: P9 residuals are positive, P2's are negative (down to −2.3σ), and they average out. A small mean residual here must not be read as “well-calibrated”; per-patient it is failing in opposite directions.

The honest read (the paper states much of this itself)

Benefits are not uniform: multi-task doesn't improve OHS total-score r (single-task OHS is higher there). R² is negative almost everywhere, so item-level prediction is unreliable; the residuals show the dominant error source is patient-level domain mismatch, with a single atypical test patient (P2) swinging the 3-patient metrics. The honest framing — which the rebuttal forced into the conclusion — is feasibility + a leakage-safe protocol + a multi-task regularisation effect, not clinically reliable prediction.

11 Flaws, limitations & the rebuttal

The paper is methodologically careful in its framing but constrained by the data and by several design choices worth scrutinising. Some issues were caught by reviewers; the authors' responses (rebuttal) are folded in below.

A. Evaluation-design issues

No cross-validation — a single 3-patient test set

The model is evaluated on one fixed split (12 train / 3 val / 3 test). It does not use leave-one-subject-out or patient-grouped k-fold CV. With only 3 test patients (12 events), every reported number is one draw from a very high-variance distribution — and the residual plots show a single atypical patient (P2) dominates the metrics. A different 3-patient draw could easily flip “multi-task wins”.

Why group k-fold would have been better

Patient-grouped k-fold or LOSO rotates every patient through the test fold, so the estimate averages over all 18 people instead of betting on 3. It is the standard fix for tiny clinical cohorts and would have produced a confidence interval rather than a single fragile point estimate. The cost the authors cite (below) is feature-selection complexity, not correctness.

SHAP feature-selection leakage (caught in review)

Reviewer 2 noticed the central “leakage-safe” claim could be undermined by how features were chosen. The authors disclosed that in the original submission SHAP was computed on the full dataset — letting test-set information steer feature selection. The camera-ready recomputes SHAP on the training set only.

They declined per-fold SHAP (which pairs naturally with k-fold CV), arguing it yields different rankings per fold and needs fold-specific pipelines; instead they use one fixed training-only ranking. Defensible for simplicity, but it is also the reason the work stays on a single split rather than CV — the two limitations are linked.

B. Result-interpretation issues

Negative R² everywhere

Almost all R² values are below 0 — i.e. worse than predicting the cohort mean on variance-explained. Only ohs-12 is positive. The model captures order (positive r on some targets) but not level.

Beaten by persistence

As §12–13 show, the trivial “predict last/average score” baselines beat the GRU on the same metrics. The paper does not report total-score-vs-persistence; our companion work supplies it.

Conclusion overstated (caught)

The original conclusion sold “~one-point average error” as forecasting ability. Reviewer 2 noted that is ~25% of the Likert range, with R²<0 and a losing comparison to persistence. The authors agreed and revised to frame the work as a leakage-safe formulation, not reliable prediction.

C. Modelling / capacity issues (authors' own limitations)

The peer-review exchange, summarised

Reviewer pointAuthors' response
R1: text implied 10 patients (old dataset version)Clarified: 18-patient (newer) dataset used
R1: too many sectionsMerged Discussion into Results; kept Conclusion / Future Work separate
R2: SHAP results not shownAdded as Figure 2
R2: was SHAP computed with test data? (leakage)Conceded — original used full data; camera-ready uses training-only SHAP; per-fold declined
R2: conclusion overstates (1-pt error, R²<0, loses to persistence)Agreed; revised to “leakage-safe formulation,” not reliable prediction
R2: SIS = “Scale” vs “Score” inconsistencyStandardised terminology
R2: SHAP/ICC never expanded; SIS citation wrong (Mick vs Nicholson)Expanded SHAP; dropped ICC; corrected citation to Nicholson et al.

Net: reviewers did not dispute the method's design; they forced honesty about leakage and about how the headline number was sold. The revised paper is a sounder, more modest version of the same contribution.

12 Our companion characterisation work

In parallel we asked a different question: not “what's the best predictor” but “how much is even predictable, and what should the next team not waste time on?” That produced four claims that survive a strict honest-evaluation protocol.

① Variance budget = a hard ceiling

Shrout–Fleiss ICC + cluster bootstrap: 66–91% of score variance is between-patient. A patient-blind model can address at most a few % — persistence captures the rest for free.

② Sub-item battery can shrink

Pure psychometrics (Cronbach α + greedy selection + bootstrap stability): SIS 6→5 robustly, OHS 12→8 mostly, OKS does not shrink on this cohort. ~33% less patient burden, defensible.

③ Upstream-imputation discovery

0 NaN → ~17% of cells are filled constants; 2 training subjects <30% trusted. We ship a per-cell trust mask + a re-runnable clean pipeline.

④ Persistence ceiling holds

On clean data, persistence is unbeaten by all 8 pre-registered univariate tests and by a multivariate ridge across 5 targets (0/5 beat it; mean Δ = +0.0046 NMAE, i.e. worse).

The honest-evaluation protocol — 6 real catches

Each is a claim we nearly shipped before the protocol caught it. This is itself a contribution: a checklist for the next team.

#Initial claimVerdict after stress-test
1chair-stand ← heart-rate, LOSO R²=0.154 (best of 552 scans)test R²=−0.002 · withdrawn (multiple-testing artifact)
2seq2seq GRU beats persistence +16% at 3rd visit2/4 test subjects drove it, perm p=0.10 · reduced
31/8 pre-registered tests Bonferroni-significant on train+val0/8 survive on frozen test
4high ICC; sleep-composition; subject-8 “recovers without changing activity”naive-ICC inflation; parts/total=0.83 not 1; imputation artifact · all withdrawn
5univariate signal on clean databest p 0.004 → 0.028; 0/8 survive correction
6multivariate ridge/MLP “most generous fair test”0/5 targets beat persistence

13 How the two works fit together

They are complementary, not contradictory. The forecasting paper builds a careful, leakage-safe method and shows multi-task learning helps relative to single-task. The characterisation work supplies the baseline the forecasting paper does not report — total-score vs persistence/subject-mean — and explains why any patient-blind sensor model is fighting uphill.

The missing baseline, side by side

When the GRU is put next to the trivial baselines on the same metrics, the persistence and subject-mean baselines win — exactly as the variance budget predicts. (Total-score rows were added by our work; the forecasting paper reports per-item only.)

Outcome · levelMethodMAEPearson r
OHS per-item (0–4)GRU + attn (paper)1.060−0.18−0.02
population mean1.056−0.09−0.13
persistence0.618+0.33+0.63
subject mean (ours)0.633+0.37+0.64
OHS total (0–48)population mean6.718−0.11−0.48
persistence4.172+0.47+0.71
subject mean (ours)4.160+0.46+0.70
SIS total (6–30)population mean3.052−0.02−0.19
persistence2.180+0.32+0.61
subject mean (ours)2.180+0.38+0.63

The GRU's per-item OHS R² (−0.18) is below even the population mean; persistence and subject-mean sit at R²≈+0.33–0.47. This is the variance budget made concrete.

What the forecasting paper does well

Kills the leakage that made earlier numbers optimistic; defines a clean 7-day-ahead protocol; demonstrates a genuine multi-task regularisation gain; ships SHAP-guided gating and recency attention as reusable ideas.

What the characterisation work adds

Quantifies the ceiling (ICC), supplies the trivial baselines the forecasting paper omits, exposes 17% upstream imputation, and packages a 6-catch honest-evaluation checklist so future teams don't chase artifacts.

The one-paragraph takeaway

MAISON-LLF recovery scores are dominated by who the patient is, not by week-to-week sensor dynamics, and ~17% of the released sensor data is filled constants. So a patient-blind model — however carefully built — is bounded to a few percent of explainable variance and loses to “predict the patient's last/average score.” The right contributions on data like this are (1) leakage-safe protocols and honest baselines, (2) characterisations that tell the field where the ceiling is, and (3) reusable method ideas (multi-task regularisation, SHAP gating) that will pay off once a larger, cleaner cohort exists.

Built from the paper PDF, explanation.tex, presentation.tex, findings.md and experiments.md. Charts are data-driven from the reported numbers; embedded figures are the project's real explainer figures. Interactive explainer · MAISON-LLF.