Building augmented data for multi-state models

library(msmtools)
library(data.table)
library(msm)

Overview

msmtools prepares longitudinal data for multi-state models fitted with msm (Jackson 2011, 2016). The package exposes four public functions:

augment() builds transition-level data from repeated observations;
polish() removes subjects with incompatible transitions at the same time;
survplot() compares fitted and empirical survival curves;
prevplot() compares observed and expected state prevalences.

The examples below use the bundled hosp dataset. It contains synthetic hospital admissions for 10 subjects.

data(hosp)
hosp[1:6, .(subj, adm_number, gender, age, label_3, dateIN, dateOUT, dateCENS)]

##     subj adm_number gender   age label_3     dateIN    dateOUT   dateCENS
##    <int>      <int> <fctr> <int>  <char>     <Date>     <Date>     <Date>
## 1:     1          1      F    83 dead_in 2008-11-30 2008-12-12 2011-04-28
## 2:     1          2      F    83 dead_in 2009-01-26 2009-02-16 2011-04-28
## 3:     1          3      F    83 dead_in 2009-05-13 2009-05-15 2011-04-28
## 4:     1          4      F    83 dead_in 2009-05-20 2009-05-25 2011-04-28
## 5:     1          5      F    83 dead_in 2009-06-12 2009-06-16 2011-04-28
## 6:     1          6      F    83 dead_in 2009-06-20 2009-06-25 2011-04-28

Data Augmentation

augment() adds one row per transition endpoint and creates status variables that can be used directly in an msm model.

hosp_augmented <- augment(
  data = copy(hosp),
  data_key = subj,
  n_events = adm_number,
  pattern = label_3,
  t_start = dateIN,
  t_end = dateOUT,
  t_cens = dateCENS
)

## Warning in augment(data = copy(hosp), data_key = subj, n_events = adm_number, :
## no t_death has been passed. Assuming that dateCENS contains both censoring and
## death times

hosp_augmented[
  1:8,
  .(subj, adm_number, label_3, augmented, augmented_int, status, status_num)
]

## Key: <subj, adm_number>
##     subj adm_number label_3  augmented augmented_int status status_num
##    <int>      <int>  <char>     <Date>         <int> <char>      <int>
## 1:     1          1 dead_in 2008-11-30         14213     IN          1
## 2:     1          1 dead_in 2008-12-12         14225    OUT          2
## 3:     1          2 dead_in 2009-01-26         14270     IN          1
## 4:     1          2 dead_in 2009-02-16         14291    OUT          2
## 5:     1          3 dead_in 2009-05-13         14377     IN          1
## 6:     1          3 dead_in 2009-05-15         14379    OUT          2
## 7:     1          4 dead_in 2009-05-20         14384     IN          1
## 8:     1          4 dead_in 2009-05-25         14389    OUT          2

When the input time columns are Date values, augment() keeps the date-valued transition time and adds an integer version. This is useful because msm works with numeric time scales.

names(hosp_augmented)

##  [1] "subj"          "adm_number"    "gender"        "age"          
##  [5] "rehab"         "it"            "rehab_it"      "label_2"      
##  [9] "label_3"       "augmented"     "augmented_int" "dateIN"       
## [13] "dateOUT"       "dateCENS"      "status"        "status_num"   
## [17] "n_status"

Outcome Schema And Generated States

pattern and state describe different parts of the augmentation. pattern is the terminal outcome schema observed in the input data. It can have two values, alive and dead, or three values, alive, dead during a transition, and dead after a transition.

state is the generated transition-state vocabulary. It must always contain three labels: the state at t_start, the state at t_end, and the absorbing state. This is why a two-value pattern still needs three state labels: augment() uses the event times to infer whether death maps to the absorbing state inside or outside the transition window.

By default, augment() uses copy = FALSE and follows data.table by-reference semantics. This avoids unnecessary memory use on large longitudinal datasets, but the input object can have its key changed and n_events can be created when the argument is omitted. Use copy = TRUE when the original input must remain unchanged.

Duplicate Transition Cleanup

polish() removes entire subjects when different transitions occur at the same time. The bundled data do not contain such conflicts, so this call leaves the data unchanged. It also uses copy = FALSE by default; set copy = TRUE when the original augmented data should not be keyed or otherwise touched by reference.

hosp_clean <- polish(
  data = copy(hosp_augmented),
  data_key = subj,
  pattern = label_3
)

nrow(hosp_augmented)

## [1] 114

nrow(hosp_clean)

## [1] 114

Survival Plot

The plotting helpers work on fitted msm objects. This example uses a compact three-state transition matrix matching the default augment() state labels.

Qmat <- matrix(0, nrow = 3, ncol = 3, byrow = TRUE)
Qmat[1, 1:3] <- 1
Qmat[2, 1:3] <- 1
colnames(Qmat) <- c("IN", "OUT", "DEAD")
rownames(Qmat) <- c("IN", "OUT", "DEAD")

msm_model <- msm(
  status_num ~ augmented_int,
  subject = subj,
  data = hosp_augmented,
  exacttimes = TRUE,
  gen.inits = TRUE,
  qmatrix = Qmat,
  method = "BFGS",
  control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)
)

surv_p <- survplot(msm_model, km = TRUE, grid = 10)

surv_p

The fitted and Kaplan-Meier data tables are attached to the plot as named fields, accessible with the standard $ operator:

surv_p$fitted[1:6]

##      time      surv
##     <num>     <num>
## 1:    1.0 0.9957174
## 2:  252.7 0.8974193
## 3:  504.4 0.8343219
## 4:  756.1 0.7756608
## 5: 1007.8 0.7211242
## 6: 1259.5 0.6704220

surv_p$km[1:6]

##    subject  time time_exact anystate    km
##      <int> <int>      <int>    <num> <num>
## 1:       5 13985          0        1  0.75
## 2:       6 14962        977        1  0.50
## 3:       1 15092       1107        1  0.25
## 4:       7 15623       1638        1  0.00
## 5:      NA    NA         NA       NA    NA
## 6:      NA    NA         NA       NA    NA

Prevalence Plot

prevplot() uses the output of msm::prevalence.msm() and returns a ggplot object.

prev <- prevalence.msm(
  msm_model,
  covariates = "mean",
  ci = "normal",
  times = seq(
    min(hosp_augmented$augmented_int),
    max(hosp_augmented$augmented_int),
    length.out = 6
  )
)

prev_p <- prevplot(msm_model, prev, ci = TRUE, M = FALSE)

prev_p

The long-format prevalence data used to build the plot is attached as $prevalence:

prev_p$prevalence[1:6]

##     time  state   obs        hat         lwr        upr
##    <int> <fctr> <num>      <num>       <num>      <num>
## 1:     0     IN     1 1.00000000 1.000000000 1.00000000
## 2:   503     IN     0 0.02390625 0.014228910 0.03510729
## 3:  1006     IN     0 0.02066274 0.009834560 0.03072290
## 4:  1510     IN     0 0.01785929 0.006561754 0.02785672
## 5:  2013     IN     0 0.01543621 0.004145471 0.02546796
## 6:  2517     IN     0 0.01334188 0.002612084 0.02328773

Notes

The current 2.x series keeps the public API stable while modernizing dependencies, documentation, tests, and CI. Larger internal changes to augment() are intentionally deferred until after the maintenance releases.

Jackson, C. H. 2011. “Multi-State Models for Panel Data: The Msm Package for R.” Journal of Statistical Software 38 (8): 1–29.

Jackson, C. H. 2016. Msm: Multi-State Markov and Hidden Markov Models in Continuous Time. https://cran.r-project.org/package=msm.

The msmtools workflow

Francesco Grossetti
francesco.grossetti@unibocconi.it

2026-06-05

Overview

Data Augmentation

Outcome Schema And Generated States

Duplicate Transition Cleanup

Survival Plot

Prevalence Plot

Notes

Building augmented data for multi-state models

The msmtools workflow

Francesco Grossettifrancesco.grossetti@unibocconi.it

2026-06-05

Overview

Data Augmentation

Outcome Schema And Generated States

Duplicate Transition Cleanup

Survival Plot

Prevalence Plot

Notes

Francesco Grossetti
francesco.grossetti@unibocconi.it