A visual walkthrough of the rehabilitation-monitoring dataset, the new multi-task forecasting model (Huo, Kyaw, Noh, Brown, Agarwal & Chan), and the companion characterisation work that measures the hard ceiling any sensor model runs into.
MAISON-LLF (“Multi-modal AI for Smart-hOme moNitoring — Lower-Limb Fracture”) pairs continuous home wearable/ambient sensing with sparse clinical questionnaires for older adults recovering from a broken hip or knee. The defining feature of this dataset is its size: rich per-day signals, but only a handful of patients and visits.
~56 days of daily sensor rows, punctuated by ~4 clinical visits where the questionnaires are filled in. Everything between visits is sensor data only.
12 train · 3 validation · 3 test patients (patient-level, no overlap). The companion
characterisation work used other patient-level splits (e.g. 14/4 with 4 frozen test IDs
6, 9, 13, 14) — the conclusions hold across splits.
Five recovery outcomes per assessment. Three are summed Likert questionnaires; two are physical-performance measures. The forecasting paper targets the two clinically central ones — OHS (function/pain) and SIS (social isolation).
12 items · each 0–4 · total 0–48 · higher = better joint function / less pain. Hip patients get OHS, knee patients get OKS (structurally identical).
e.g. “Climbing stairs is painful” → 0 (always) … 4 (never)
6 items · each 1–5 · total 6–30 · higher = less isolated.
e.g. “I have people I can talk to” → 1 (never) … 5 (always)
tug Timed-Up-and-Go (seconds, lower = better) ·
chairstand sit-stands in 30 s (count, higher = stronger). One continuous value each.
Items are summed into the total. This is exactly what the model's task heads must reproduce.
| Patient | OHS items (12 × 0–4) | OHS total | SIS items (6 × 1–5) | SIS total |
|---|---|---|---|---|
| A — good hip, well-connected | 3,4,3,4,4,4,3,4,3,4,3,4 | 43 | 5,5,4,5,5,4 | 28 |
| B — poor hip, some isolation | 1,2,2,2,3,2,1,2,1,2,1,2 | 21 | 3,4,3,4,3,2 | 19 |
| C — ceiling on both | 4,4,4,4,4,4,4,4,4,4,4,4 | 48 | 5,5,5,5,5,5 | 30 |
sis-02: mean 4.5, σ≈0.6 —
predicting the mean is near-perfect), others carry most of the variance (ohs-11: σ≈1.6).
(real data)Four behavioural/physiological groups, all at daily resolution. Click a group to see what it contains and how it relates to recovery.
motion-mean — a foretaste of the imputation problem in §5.
(real data)Before any model: where does the variance live? If most of it is between patients (some people are simply sicker than others) rather than within a patient over time, then the trivial “predict last value” baseline already captures the lion's share — and a patient-blind sensor model is fighting over the scraps.
Standard deviation of patient means (between) vs mean of within-patient SD. Between dominates on every scale.
| Scale | Between SD | Within SD | ≥10-pt swingers |
|---|---|---|---|
| OHS (0–48) | 7.81 | 3.10 | 4 / 18 |
| SIS (6–30) | 3.34 | 1.89 | 0 / 18 |
| OKS (0–48) | 9.44 | 3.16 | 5 / 18 |
Shrout–Fleiss ICC(1,1) with cluster-bootstrap 95% CIs. %between is the share of total variance a patient-blind model can never recover; persistence captures it for free. The sensor-explainable headroom is at most a few percent on every target.
With 66–91% of variance between patients and Likert quantisation removing a little more, the sensor-explainable upper bound is ≈6% at best. On the clean data, honest test R² is ≤ 0 on every pre-registered hypothesis. This is a property of the data, not of any one model.
The released sensor matrix has zero NaN cells — which is impossible for real wearables (people forget to wear/charge the watch). The upstream pipeline silently filled the gaps, usually by carrying the last value forward. Roughly 17% of all cells are filled constants, not measurements, and it's wildly uneven across patients.
56 consecutive days, every value exactly 1504. σ=0, 1 unique value, longest run = 56. A worn step-counter on a recovering patient never does this.
| day | step-count | acc-mean | sleep-deep | note |
|---|---|---|---|---|
| 1 | 1504 | 0.243 | 124 | fill begins |
| 2 | 1504 | 0.218 | 112 | constant |
| 3 | 1504 | 0.301 | 138 | |
| … | 1504 | … | … | (50 more identical) |
| 56 | 1504 | 0.239 | 124 | end |
Other channels jitter day-to-day; only step-count is frozen — the
signature of per-cell forward-fill.
flag_long_run: longest identical run ≥ 14 days.flag_low_unique: ≤ 5 distinct values across all 56 days.flag_zero_sd: σ < 1e-6 (channel is a constant).A cell tripping any flag is treated as imputed. Cross-checked on high-trust patients to keep false positives low.
motion-max (14/18 patients), position-duration (10/18),
sleep-snoring (8/18). Zero of 46 channels are clean cohort-wide.
% of a patient's 46 channels passing all 3 audits. Hover a bar. Two training subjects (8, 18) are >70% imputed — a sensor model is largely fitting constants for them.
“Multimodal Forecasting of Psychosocial and Functional Recovery in Older Adults After Lower-Limb Fracture” — Huo, Kyaw, Noh, Brown, Agarwal & Chan (University of Toronto · KMUTT). It reframes MAISON-LLF from concurrent estimation into a strict, leakage-safe forecasting task and proposes a multi-task model that jointly predicts future OHS and SIS.
Earlier MAISON-LLF analyses attached each clinical label to the sensor window preceding or around the assessment. That blurs into concurrent estimation and risks information leakage — using data from on/after the assessment day to “predict” it.
In deployment you must predict before the visit, from past data only.
Predict the score at assessment n using only observations up to day Tn − 7 (a 7-day-ahead forecast). Histories are expanding windows: the first visit forecasts from the first ~7 days, the second from ~21 days, and so on.
Strict chronological input→target separation, patient-level test split.
CatBoostRegressor for OHS and SIS over 46 sensor + 7 demographic features.| Component | Problem it targets | Mechanism |
|---|---|---|
| SHAP feature gate | 46 noisy channels, most useless at N=18 | Down-weights low-value features instead of hard-dropping; differentiable, trained end-to-end |
| Recency-biased attention | Recovery is non-stationary; recent days matter more | Learnable decay weights more recent time-steps; one pattern per outcome |
| Masked / packed sequences | Patients have different history lengths (expanding window) | Pad + mask so attention ignores absent steps |
| Multi-task heads | Few labels per outcome | Shared encoder ⇒ OHS supervision regularises SIS and vice-versa |
| Item-level Smooth-L1 | Totals are sums of ordinal items | Predict items, sum to totals; robust loss vs outliers |
Training: AdamW, lr 2e-3, hidden dim 24, dropout 0.3, ≤500 epochs, early-stop patience 45, grad-clip 1.0, seed 42.
From the paper's SHAP summary plots — the consistently high-ranked sensor families. Demographics ranked low enough that the grid search kept zero of them.
Indicative ranking drawn from the paper's Fig. 2 (acceleration / movement-event bins, sleep, step & motion counts, heart-rate dominate). Heights are illustrative ordinal ranks, not raw SHAP magnitudes.
The pipeline stacks five ideas that each deserve a plain-language explanation: gradient boosting (CatBoost), SHAP attribution, learnable feature gates, the GRU recurrent network, and attention. Here is what each one actually is and why it's used.
Decision tree: a flowchart of yes/no questions on features (“is mean step-count > 3000?”) ending in a numeric prediction at each leaf. One tree is weak and overfits.
Gradient boosting: build many small trees in sequence, where each new tree is trained to predict the residual error left by all previous trees. Add them up: prediction = tree₁ + tree₂ + … Each tree nudges the prediction toward the truth a little. This is the dominant method for tabular data — usually beating neural nets when rows are few and features are heterogeneous.
What “CatBoost” adds (Yandex's variant): (i) native handling of categorical features via target statistics; (ii) ordered boosting — a permutation trick that computes each row's residual using only rows seen “before” it, which reduces the target-leakage/overfitting that plain boosting suffers on small data. That small-N robustness is exactly why the paper uses it.
CatBoost is not the forecaster. It is used only in Stage I as a quick, strong tabular model whose predictions can be explained by SHAP — to rank which of the 46 sensor + 7 demographic features matter, before the GRU is built. A separate CatBoostRegressor is fit for OHS and for SIS.
A model like CatBoost is a black box: it gives a number, not a reason. SHAP answers “how much did each feature contribute to this prediction, vs the average prediction?”
The Shapley value comes from cooperative game theory (Lloyd Shapley, 1953). Treat each feature as a “player” in a game whose “payout” is the model's prediction. A feature's Shapley value is its average marginal contribution across every possible order in which features could be added to the model. It is the unique attribution that is fair (efficiency: contributions sum to the prediction; symmetry; dummy features get zero). SHAP computes these efficiently for trees.
Reading the plot (paper Fig. 2 below): each row is a feature, each dot is one patient-day. Dot position = that feature's SHAP value (push toward higher/lower predicted score); colour = the feature's value (red high, blue low). Features are sorted by mean |SHAP| — the average magnitude of influence. The paper takes that ranking to pick the top-M features.
In the original submission, SHAP was computed on the full dataset — so test-set information influenced which features were selected, contaminating the “leakage-safe” claim. Reviewer 2 caught this. The camera-ready recomputes SHAP on the training split only. The authors declined to compute SHAP per-fold (it would give different rankings per fold and need fold-specific pipelines); they use one fixed training-only ranking instead. See §10.
With 46 noisy channels and ~50 training events, feeding everything to the network invites overfitting. Two crude options: drop low-SHAP features entirely (loses information, the cutoff is arbitrary) or weight them all equally (lets noise through). A gate is the soft middle.
Each input feature gets a learnable weight, and the gated input is the element-wise product:
z̃ = w_g ⊙ z. So feature j enters the model scaled by w_g[j]. The
network can learn to shrink a useless channel toward 0 and amplify a useful one — and because it's
just multiplication, the weights are trained jointly with everything else by gradient descent.
Two design choices that matter:
w_g = softplus(θ) / mean(softplus(θ)).
Softplus keeps weights positive; dividing by the mean fixes the average gate at 1, so the
gate can re-weight features relative to each other but cannot just globally rescale the whole
input (which would fight with the network's own scaling and destabilise training).This is a sensible regulariser, but it is also an extra set of trained parameters on a tiny dataset, and the paper reports no ablation isolating the gate's contribution — so we can't tell how much it actually helps vs. SHAP-based hard selection alone.
The problem: a patient's history is a sequence of daily vectors of varying length. You want one summary that respects order (recent days may matter more) and handles any length.
RNN (recurrent neural network): walk through the sequence one day at a time, carrying a
“hidden state” h — a memory vector. At each day, new input + previous memory →
updated memory. The final memory summarises the whole history. Plain RNNs forget long-range
information (vanishing gradients).
GRU fixes this with two gates that learn what to keep vs. overwrite at each step:
So h_new = (1 − z)·h_old + z·h_candidate, with h_candidate built from the
new input and a reset-gated copy of h_old. The GRU is a lighter cousin of the LSTM
(2 gates vs 3, no separate cell state) — fewer parameters, which suits small data.
How the paper uses it: selected features → feature gate → linear projection + LayerNorm + GELU + dropout → GRU reads the expanding-window day sequence → produces one hidden state per day. Variable lengths are handled by packing + masking (padded days are ignored). Rather than only using the last hidden state, all states are pooled by attention (next tab).
The GRU emits one hidden state per day. Attention collapses them into a single context
vector as a weighted average, where the model learns the weights:
c = Σ_τ α_τ · h_τ, with weights α summing to 1 over the valid days.
Important days get larger α; padded days are masked to 0.
Recency bias: a learnable term makes recent days count more by default — natural for recovery, where the latest fortnight is more informative than week one.
Task-specific: OHS and SIS get separate attention modules over the shared GRU, so each outcome can weight the timeline its own way. The context vector then feeds a task head (OHS → 12 item scores, SIS → 6 item scores), which are summed into totals.
Every component targets a real small-data problem: gating & SHAP fight feature noise, the GRU+masking handle variable-length histories, attention+recency model non-stationary recovery, and multi-task heads share statistical strength. The ideas are sound; the open question (§10) is whether N=18 can train them.
The paper reports four numbers per outcome — MAE, RMSE, R², Pearson r — at two granularities (per-item and total-score). Knowing exactly what each means is essential to reading the results honestly, because they can disagree.
MAE = mean(|predicted − actual|). The average size of the miss, in the score's own
units. Lower is better; 0 is perfect. Easy to read: “OHS total MAE 5.23” = off by ~5.2 points
on a 0–48 scale on average.
Robust to outliers (no squaring). But it has no built-in reference — “5.2” is only good or bad relative to a baseline, which is the whole §10 argument.
RMSE = sqrt(mean((predicted − actual)²)). Like MAE but squares errors first, so
big misses are punished much more. Lower is better. RMSE ≥ MAE always; a large gap
between them signals a few large errors (here: one atypical test patient).
R² = 1 − SS_res / SS_tot = the fraction of variance the model explains relative to
just predicting the mean. 1 = perfect; 0 = no better than predicting the mean; <0 = worse
than the mean.
This is the key one. The paper's R² values are negative (e.g. OHS total −1.06). Negative R² means the GRU's predictions are worse than a flat line at the cohort average — it is not just imperfect, it actively underperforms the most trivial baseline on variance-explained.
r ∈ [−1, 1]: do predictions and truth move together, regardless of offset or
scale? +1 = perfect ranking, 0 = none, −1 = inverted.
r ignores systematic bias — you can have decent r and terrible R² if predictions
track the trend but sit consistently too high/low. That is exactly the paper's OHS total: r=+0.53 but
R²=−1.06 (right order, wrong level). And sis-02 has r=−0.79: predictions move the
wrong way.
Per-item: score each individual questionnaire item (12 OHS items 0–4; 6 SIS items 1–5) and pool metrics across all items. Fine-grained — “can the model get each question right?”
Total-score: sum the predicted items into the questionnaire total (OHS 0–48, SIS 6–30) and score that. This is the clinically meaningful number.
Summing 12 (or 6) noisy item predictions lets independent item errors partially cancel (some too high, some too low), so the total's correlation rises even when items are individually shaky. In the paper, OHS total r (+0.53) ≫ OHS per-item r (−0.04). The flip side: a good total can hide that the model isn't really predicting the items — aggregation masks per-item failure rather than fixing it.
The headline is an internal comparison: joint multi-task training beats training each outcome alone on most error metrics. Absolute performance stays limited by the tiny cohort.
OHS total MAE 5.693 → 5.233 (−8.1%); SIS total MAE 3.103 → 2.492 (−19.7%). SIS total Pearson r flips from −0.209 (single) to +0.278 (joint) — the clearest sign OHS supervision helps SIS.
| Outcome | level | MAE | R² | r |
|---|---|---|---|---|
| OHS | per-item | 1.046 | −0.43 | −0.04 |
| OHS | total | 5.233 | −1.06 | +0.53 |
| SIS | per-item | 0.868 | −0.17 | +0.28 |
| SIS | total | 2.492 | −0.30 | +0.28 |
Total-score r > per-item r: summing 12/6 items cancels item noise and lets the coarse recovery trend show. R² stays negative — predictions track order but carry a systematic offset.
Hover bars. A few items are learnable (ohs-12: r=0.84, the only
positive-R² item); some are anti-correlated (sis-02: r=−0.79 — sensors mislead for that
social item).
Pooled item metrics hide enormous spread. Reading Table 3 item-by-item is what reveals whether the model learned anything transferable.
ohs-12 is the one real success — the only item with positive R²
(+0.353) and a strong r (+0.839). Some recovery dimension it captures is consistently reflected in
the sensors.ohs-07, ohs-08, ohs-09 have positive r but negative
R² — the model tracks their trend but with a systematic offset (right direction, wrong level).ohs-11 is the worst (MAE 2.15, R²=−3.94) — and recall from §2 it carries the
most total-score variance, so the model fails hardest exactly where it matters most.sis-02 is actively misleading — r=−0.792, predictions anti-correlate
with truth. The sensor cues informative for other items point the wrong way for this social item,
which has no clear behavioural correlate in a wearable.Items are quantised to 5 levels and individually noisy; the network hedges toward the middle. Only when 12/6 items are summed do independent errors cancel and the coarse recovery trend emerge — lifting total-score r well above per-item r.
This is the central tension: the clinically useful total looks decent
(r≈0.5) because aggregation launders item-level noise, not because the model
predicts the underlying items. SIS is the giveaway — better total metrics than OHS, yet its items are
weaker; the SIS total is averaging away errors, not reflecting genuine item skill.
Plotting predicted vs. ground-truth (the dashed diagonal = perfect) exposes how the model fails, not just how much.
Residuals are almost all negative (mean −1.18σ) — the model predicts OHS scores that are consistently too high for unseen patients. P2 dominates (−1.3σ to −2.0σ across the range). A consistent bias like this means the failure is a domain mismatch (test patients differ from train), not random noise — and bias is what destroys R² even when r looks okay.
Mean residual is only −0.26σ — but that is cancellation, not calibration: P9 residuals are positive, P2's are negative (down to −2.3σ), and they average out. A small mean residual here must not be read as “well-calibrated”; per-patient it is failing in opposite directions.
Benefits are not uniform: multi-task doesn't improve OHS total-score r (single-task OHS is higher there). R² is negative almost everywhere, so item-level prediction is unreliable; the residuals show the dominant error source is patient-level domain mismatch, with a single atypical test patient (P2) swinging the 3-patient metrics. The honest framing — which the rebuttal forced into the conclusion — is feasibility + a leakage-safe protocol + a multi-task regularisation effect, not clinically reliable prediction.
The paper is methodologically careful in its framing but constrained by the data and by several design choices worth scrutinising. Some issues were caught by reviewers; the authors' responses (rebuttal) are folded in below.
The model is evaluated on one fixed split (12 train / 3 val / 3 test). It does not use leave-one-subject-out or patient-grouped k-fold CV. With only 3 test patients (12 events), every reported number is one draw from a very high-variance distribution — and the residual plots show a single atypical patient (P2) dominates the metrics. A different 3-patient draw could easily flip “multi-task wins”.
Patient-grouped k-fold or LOSO rotates every patient through the test fold, so the estimate averages over all 18 people instead of betting on 3. It is the standard fix for tiny clinical cohorts and would have produced a confidence interval rather than a single fragile point estimate. The cost the authors cite (below) is feature-selection complexity, not correctness.
Reviewer 2 noticed the central “leakage-safe” claim could be undermined by how features were chosen. The authors disclosed that in the original submission SHAP was computed on the full dataset — letting test-set information steer feature selection. The camera-ready recomputes SHAP on the training set only.
They declined per-fold SHAP (which pairs naturally with k-fold CV), arguing it yields different rankings per fold and needs fold-specific pipelines; instead they use one fixed training-only ranking. Defensible for simplicity, but it is also the reason the work stays on a single split rather than CV — the two limitations are linked.
Almost all R² values are below 0 — i.e. worse than predicting the cohort mean on
variance-explained. Only ohs-12 is positive. The model captures order (positive r on
some targets) but not level.
As §12–13 show, the trivial “predict last/average score” baselines beat the GRU on the same metrics. The paper does not report total-score-vs-persistence; our companion work supplies it.
The original conclusion sold “~one-point average error” as forecasting ability. Reviewer 2 noted that is ~25% of the Likert range, with R²<0 and a losing comparison to persistence. The authors agreed and revised to frame the work as a leakage-safe formulation, not reliable prediction.
| Reviewer point | Authors' response |
|---|---|
| R1: text implied 10 patients (old dataset version) | Clarified: 18-patient (newer) dataset used |
| R1: too many sections | Merged Discussion into Results; kept Conclusion / Future Work separate |
| R2: SHAP results not shown | Added as Figure 2 |
| R2: was SHAP computed with test data? (leakage) | Conceded — original used full data; camera-ready uses training-only SHAP; per-fold declined |
| R2: conclusion overstates (1-pt error, R²<0, loses to persistence) | Agreed; revised to “leakage-safe formulation,” not reliable prediction |
| R2: SIS = “Scale” vs “Score” inconsistency | Standardised terminology |
| R2: SHAP/ICC never expanded; SIS citation wrong (Mick vs Nicholson) | Expanded SHAP; dropped ICC; corrected citation to Nicholson et al. |
Net: reviewers did not dispute the method's design; they forced honesty about leakage and about how the headline number was sold. The revised paper is a sounder, more modest version of the same contribution.
In parallel we asked a different question: not “what's the best predictor” but “how much is even predictable, and what should the next team not waste time on?” That produced four claims that survive a strict honest-evaluation protocol.
Shrout–Fleiss ICC + cluster bootstrap: 66–91% of score variance is between-patient. A patient-blind model can address at most a few % — persistence captures the rest for free.
Pure psychometrics (Cronbach α + greedy selection + bootstrap stability): SIS 6→5 robustly, OHS 12→8 mostly, OKS does not shrink on this cohort. ~33% less patient burden, defensible.
0 NaN → ~17% of cells are filled constants; 2 training subjects <30% trusted. We ship a per-cell trust mask + a re-runnable clean pipeline.
On clean data, persistence is unbeaten by all 8 pre-registered univariate tests and by a multivariate ridge across 5 targets (0/5 beat it; mean Δ = +0.0046 NMAE, i.e. worse).
Each is a claim we nearly shipped before the protocol caught it. This is itself a contribution: a checklist for the next team.
| # | Initial claim | Verdict after stress-test |
|---|---|---|
| 1 | chair-stand ← heart-rate, LOSO R²=0.154 (best of 552 scans) | test R²=−0.002 · withdrawn (multiple-testing artifact) |
| 2 | seq2seq GRU beats persistence +16% at 3rd visit | 2/4 test subjects drove it, perm p=0.10 · reduced |
| 3 | 1/8 pre-registered tests Bonferroni-significant on train+val | 0/8 survive on frozen test |
| 4 | high ICC; sleep-composition; subject-8 “recovers without changing activity” | naive-ICC inflation; parts/total=0.83 not 1; imputation artifact · all withdrawn |
| 5 | univariate signal on clean data | best p 0.004 → 0.028; 0/8 survive correction |
| 6 | multivariate ridge/MLP “most generous fair test” | 0/5 targets beat persistence |
They are complementary, not contradictory. The forecasting paper builds a careful, leakage-safe method and shows multi-task learning helps relative to single-task. The characterisation work supplies the baseline the forecasting paper does not report — total-score vs persistence/subject-mean — and explains why any patient-blind sensor model is fighting uphill.
When the GRU is put next to the trivial baselines on the same metrics, the persistence and subject-mean baselines win — exactly as the variance budget predicts. (Total-score rows were added by our work; the forecasting paper reports per-item only.)
| Outcome · level | Method | MAE | R² | Pearson r |
|---|---|---|---|---|
| OHS per-item (0–4) | GRU + attn (paper) | 1.060 | −0.18 | −0.02 |
| population mean | 1.056 | −0.09 | −0.13 | |
| persistence | 0.618 | +0.33 | +0.63 | |
| subject mean (ours) | 0.633 | +0.37 | +0.64 | |
| OHS total (0–48) | population mean | 6.718 | −0.11 | −0.48 |
| persistence | 4.172 | +0.47 | +0.71 | |
| subject mean (ours) | 4.160 | +0.46 | +0.70 | |
| SIS total (6–30) | population mean | 3.052 | −0.02 | −0.19 |
| persistence | 2.180 | +0.32 | +0.61 | |
| subject mean (ours) | 2.180 | +0.38 | +0.63 |
The GRU's per-item OHS R² (−0.18) is below even the population mean; persistence and subject-mean sit at R²≈+0.33–0.47. This is the variance budget made concrete.
Kills the leakage that made earlier numbers optimistic; defines a clean 7-day-ahead protocol; demonstrates a genuine multi-task regularisation gain; ships SHAP-guided gating and recency attention as reusable ideas.
Quantifies the ceiling (ICC), supplies the trivial baselines the forecasting paper omits, exposes 17% upstream imputation, and packages a 6-catch honest-evaluation checklist so future teams don't chase artifacts.
MAISON-LLF recovery scores are dominated by who the patient is, not by week-to-week sensor dynamics, and ~17% of the released sensor data is filled constants. So a patient-blind model — however carefully built — is bounded to a few percent of explainable variance and loses to “predict the patient's last/average score.” The right contributions on data like this are (1) leakage-safe protocols and honest baselines, (2) characterisations that tell the field where the ceiling is, and (3) reusable method ideas (multi-task regularisation, SHAP gating) that will pay off once a larger, cleaner cohort exists.
explanation.tex, presentation.tex,
findings.md and experiments.md. Charts are data-driven from the reported
numbers; embedded figures are the project's real explainer figures. Interactive explainer · MAISON-LLF.