## Thursday, November 29, 2007

### Comparing sampling models by Bayes factors

In the last posting, I illustrated fitting a t(4) sampling model to some baseball data where I suspected there was an outlier. To compare this model to other sampling models, we can use Bayes factors.

Here is a t sampling model with a convenient choice of prior.

1.$y_1, ..., y_n$is a random sample from $t(\mu, \sigma, df)$
2. $\mu$ and $\log \sigma$ are independent, with $\mu$ distributed $N(M_0, S_0)$ and $\log \sigma$ distributed $N(M_1, S_1)$

I define a R function tsampling.R that computes the logarithm of the posterior. Note that I am careful to include all of the normalizing constants, since I am primarily interested in computing the marginal density of $y$.

tsampling=function(theta,datapar)
{
mu=theta[1]; logsigma=theta[2]; sigma=exp(logsigma)

y=datapar\$y; df=datapar\$df
mu.mean=datapar\$mu.mean; mu.sd=datapar\$mu.sd
lsigma.mean=datapar\$lsigma.mean; lsigma.sd=datapar\$lsigma.sd

loglike=sum(dt((y-mu)/sigma,df,log=TRUE)-log(sigma))
logprior=dnorm(mu,mu.mean,mu.sd,log=TRUE)+
dnorm(logsigma,lsigma.mean,lsigma.sd,log=TRUE)

return(loglike+logprior)
}

We use this function together with the function laplace to compute the log marginal density for two models.

Model 1 -- t(4) sampling, $\mu$ is normal(90, 20), $\log \sigma$ is normal(1, 1).

Model 2 -- t(30) sampling, $\mu$ is normal(90, 20), $\log \sigma$ is normal(1, 1).

Note that I'm using the same prior for both models -- the only difference is the choice of degrees of freedom in the sampling density.

dataprior1=list(y=y, df=4, mu.mean=90, mu.sd=20,
lsigma.mean=1, lsigma.sd=1)
log.m1=laplace(tsampling,c(80, 3), dataprior1)\$int

dataprior2=list(y=y, df=30, mu.mean=90, mu.sd=20,
lsigma.mean=1, lsigma.sd=1)
log.m2=laplace(tsampling,c(80, 3), dataprior2)\$int

BF.12=exp(log.m1-log.m2)
BF.12
[1] 14.8463

We see that there is substantial support for the t(4) model over the "close to normal" t(30) model.

### Robust Modeling using Gibbs Sampling

To illustrate the use of Gibbs sampling for robust modeling, here are the batting statistics for the "regular" players on the San Francisco Giants for the 2007 baseball season:

`Pos Player              Ag   G   AB    R    H   2B 3B  HR  RBI  BB  SO   BA    OBP   SLG  SB  CS  GDP HBP  SH  SF IBB  OPS+---+-------------------+--+----+----+----+----+---+--+---+----+---+----+-----+-----+-----+---+---+---+---+---+---+---+----+C   Bengie Molina       32  134  497   38  137  19  1  19   81  15   53  .276  .298  .433   0   0  13   2   1   2   2   861B *Ryan Klesko         36  116  362   51   94  27  3   6   44  46   68  .260  .344  .401   5   1  14   1   1   1   2   922B #Ray Durham          35  138  464   56  101  21  2  11   71  53   75  .218  .295  .343  10   2  18   2   0   9   6   653B  Pedro Feliz         32  150  557   61  141  28  2  20   72  29   70  .253  .290  .418   2   2  15   1   0   3   2   81SS #Omar Vizquel        40  145  513   54  126  18  3   4   51  44   48  .246  .305  .316  14   6  14   1  14   3   6   62LF *Barry Bonds         42  126  340   75   94  14  0  28   66 132   54  .276  .480  .565   5   0  13   3   0   2  43  170CF *Dave Roberts        35  114  396   61  103  17  9   2   23  42   66  .260  .331  .364  31   5   4   0   4   0   1   80RF #Randy Winn          33  155  593   73  178  42  1  14   65  44   85  .300  .353  .445  15   3  12   7   4   5   3  105    Rich Aurilia        35   99  329   40   83  19  2   5   33  22   45  .252  .304  .368   0   0   8   4   0   3   1   73    Kevin Frandsen      25  109  264   26   71  12  1   5   31  21   24  .269  .331  .379   4   3  17   5   3   3   3   84   *Fred Lewis          26   58  157   34   45   6  2   3   19  19   32  .287  .374  .408   5   1   4   3   1   0   0  103   #Dan Ortmeier        26   62  157   20   45   7  4   6   16   7   41  .287  .317  .497   2   1   2   1   0   2   1  107    Rajai Davis         26   51  142   26   40   9  1   1    7  14   25  .282  .363  .380  17   4   0   4   2   0   1   93   *Nate Schierholtz    23   39  112    9   34   5  3   0   10   2   19  .304  .316  .402   3   1   0   1   0   2   0   85   *Mark Sweeney        37   76   90   18   23   8  0   2   10  13   18  .256  .368  .411   2   0   0   3   1   0   0  102`
We'll focus on the last batting measure OPS that is a summary of a player's batting effectiveness.

We read the OPS values for the 15 players into R into the vector y.

> y
[1] 86 92 65 81 62 170 80 105 73 84 103 107 93 85 102

We assume that the observations y1, ..., y15 are iid from a t distribution with location $\mu$, scale $\sigma$ and 4 degrees of freedom. We place the usual noninformative prior $1/\sigma$ on $(\mu, \sigma)$.

We implement 10,000 iterations of Gibbs sampling by use of the function robustt.R in the LearnBayes package. This function is easy to use -- we just input the data vector y, the degrees of freedom, and the number of iterations.

fit=robustt(y,4,10000)

The object fit is a list with components mu, sigma2, and lam -- mu is a vector of simulated draws of $\mu$, sigma2 is a vector of simulated draws of $\sigma^2$, and lam is a matrix of simulated draws of the scale parameters $\lambda_1, ..., \lambda_n$. (The $\lambda_i$ are helpful for identifying outliers in the data -- here the outlier is pretty obvious.)

Below I have graphed the data as solid dots and placed a density estimate of the posterior of $\mu$ on top. We see that the estimate of $\mu$ appears to ignore the one outlier (Barry Bonds) in the data.

plot(y,0*y,cex=1.5,pch=19,ylim=c(0,.1),ylab="DENSITY",xlab="MU")
lines(density(fit\$mu),lwd=3,col="red")

By the way, we have assumed that robust modeling was suitable for this dataset since I knew that the Giants had one unusually good hitter on their team. Can we formally show that robust modeling the t(4) distribution is better than normal modeling for these data?

Sure -- we can define two models (with the choice of proper prior distributions) and compare the models by use of a Bayes factor. I'll illustrate this in my next posting.

## Monday, November 26, 2007

### Gibbs Sampling for Censored Regression

Gibbs sampling is very convenient for many "missing data" problems. To illustrate this situation, suppose we have the simple regression model

$z_i = \beta_0 + \beta_1 x_i + \epsilon_i$,

where the errors $\epsilon_1, ..., \epsilon_n$ are iid $N(0, \sigma^2)$. The problem is that some of the response variables $z_i$ are right-censored and we actually observe $y_i = \min(z_i, c_i)$, where $c_i$ is a known censoring value. We know which observations are uncensored ($y_i = z_i$) and which observations are censored ($y_i = c_i$).

We illustrate this situation by use of a picture inspired by a similar one in Martin Tanner's book. We show a scatterplot of the covariate $x$ against the observed $y$. The points highlighted in red correspond to censored observations -- the actual observations (the $z_i$) exceed the censored values, which we indicate by arrows.

To apply Gibbs sampling to this situation, we imagine a complete data set where all of the $z_i$'s are known. The unknowns in this problem are the regression coefficients $\beta_0, \beta_1$, the error variance $\sigma^2$, and the $z_i$'s corresponding to the censored observations.

The joint posterior of all unobservables (assuming the usual vague prior for regression) has the form
$
\frac{1}{\sigma^2} \prod_{i=1}^n \phi(z_i, \beta_0+\beta_1 x_i, \sigma^2) I(z_i, y_i),
$

where $I(z_i, y_i)$ is equal to 1 if the observation is not censored, and $I(z_i, y_i) = I(z_i > c_i)$ if the observation is censored at $c_i$.

With the introduction of the complete data set, Gibbs sampling is straightforward.

Suppose one has initial estimates at the regression coefficients and the error variance. Then

(1) One simulates from the distribution of the missing data (the $z_i$ for the censored observations) given the parameters. Specifically $z_i$ is simulated from a normal($\beta_0 + \beta_1 x_i, \sigma^2$) distribution censored below by $c_i$.

(2) Given the complete data, one simulates values of the parameters using the usual approach for a normal linear regression model.

To do this on R, I have a couple of useful tools. The function rtruncated.R will simulate draws from an arbitrary truncated distribution

rtruncated=function(n,lo,hi,pf,qf,...)
qf(pf(lo,...)+runif(n)*(pf(hi,...)-pf(lo,...)),...)

For example, if one wishes to simulate 20 draws of a normal(mean = 10, sd = 2) distribution that is truncated below by 4, one writes

rtruncated(20, 4, Inf, pnorm, qnorm, 10, 2)

Also the function blinreg.R in the LearnBayes package will simulate draws of a regression coefficient $\beta$ and the error variance $\sigma^2$ for a normal linear model with a noninformative prior.