## Monday, November 26, 2007

### Gibbs Sampling for Censored Regression

Gibbs sampling is very convenient for many "missing data" problems. To illustrate this situation, suppose we have the simple regression model

$z_i = \beta_0 + \beta_1 x_i + \epsilon_i$,

where the errors $\epsilon_1, ..., \epsilon_n$ are iid $N(0, \sigma^2)$. The problem is that some of the response variables $z_i$ are right-censored and we actually observe $y_i = \min(z_i, c_i)$, where $c_i$ is a known censoring value. We know which observations are uncensored ($y_i = z_i$) and which observations are censored ($y_i = c_i$).

We illustrate this situation by use of a picture inspired by a similar one in Martin Tanner's book. We show a scatterplot of the covariate $x$ against the observed $y$. The points highlighted in red correspond to censored observations -- the actual observations (the $z_i$) exceed the censored values, which we indicate by arrows.

To apply Gibbs sampling to this situation, we imagine a complete data set where all of the $z_i$'s are known. The unknowns in this problem are the regression coefficients $\beta_0, \beta_1$, the error variance $\sigma^2$, and the $z_i$'s corresponding to the censored observations.

The joint posterior of all unobservables (assuming the usual vague prior for regression) has the form
$
\frac{1}{\sigma^2} \prod_{i=1}^n \phi(z_i, \beta_0+\beta_1 x_i, \sigma^2) I(z_i, y_i),
$

where $I(z_i, y_i)$ is equal to 1 if the observation is not censored, and $I(z_i, y_i) = I(z_i > c_i)$ if the observation is censored at $c_i$.

With the introduction of the complete data set, Gibbs sampling is straightforward.

Suppose one has initial estimates at the regression coefficients and the error variance. Then

(1) One simulates from the distribution of the missing data (the $z_i$ for the censored observations) given the parameters. Specifically $z_i$ is simulated from a normal($\beta_0 + \beta_1 x_i, \sigma^2$) distribution censored below by $c_i$.

(2) Given the complete data, one simulates values of the parameters using the usual approach for a normal linear regression model.

To do this on R, I have a couple of useful tools. The function rtruncated.R will simulate draws from an arbitrary truncated distribution

rtruncated=function(n,lo,hi,pf,qf,...)
qf(pf(lo,...)+runif(n)*(pf(hi,...)-pf(lo,...)),...)

For example, if one wishes to simulate 20 draws of a normal(mean = 10, sd = 2) distribution that is truncated below by 4, one writes

rtruncated(20, 4, Inf, pnorm, qnorm, 10, 2)

Also the function blinreg.R in the LearnBayes package will simulate draws of a regression coefficient $\beta$ and the error variance $\sigma^2$ for a normal linear model with a noninformative prior.