Thursday, November 29, 2007

Comparing sampling models by Bayes factors

In the last posting, I illustrated fitting a t(4) sampling model to some baseball data where I suspected there was an outlier. To compare this model to other sampling models, we can use Bayes factors.

Here is a t sampling model with a convenient choice of prior.

1.is a random sample from
2. and are independent, with distributed and distributed

I define a R function tsampling.R that computes the logarithm of the posterior. Note that I am careful to include all of the normalizing constants, since I am primarily interested in computing the marginal density of .

tsampling=function(theta,datapar)
{
mu=theta[1]; logsigma=theta[2]; sigma=exp(logsigma)

y=datapar$y; df=datapar$df
mu.mean=datapar$mu.mean; mu.sd=datapar$mu.sd
lsigma.mean=datapar$lsigma.mean; lsigma.sd=datapar$lsigma.sd

loglike=sum(dt((y-mu)/sigma,df,log=TRUE)-log(sigma))
logprior=dnorm(mu,mu.mean,mu.sd,log=TRUE)+
dnorm(logsigma,lsigma.mean,lsigma.sd,log=TRUE)

return(loglike+logprior)
}

We use this function together with the function laplace to compute the log marginal density for two models.

Model 1 -- t(4) sampling, is normal(90, 20), is normal(1, 1).

Model 2 -- t(30) sampling, is normal(90, 20), is normal(1, 1).

Note that I'm using the same prior for both models -- the only difference is the choice of degrees of freedom in the sampling density.

dataprior1=list(y=y, df=4, mu.mean=90, mu.sd=20,
lsigma.mean=1, lsigma.sd=1)
log.m1=laplace(tsampling,c(80, 3), dataprior1)$int

dataprior2=list(y=y, df=30, mu.mean=90, mu.sd=20,
lsigma.mean=1, lsigma.sd=1)
log.m2=laplace(tsampling,c(80, 3), dataprior2)$int

BF.12=exp(log.m1-log.m2)
BF.12
[1] 14.8463

We see that there is substantial support for the t(4) model over the "close to normal" t(30) model.

Robust Modeling using Gibbs Sampling

To illustrate the use of Gibbs sampling for robust modeling, here are the batting statistics for the "regular" players on the San Francisco Giants for the 2007 baseball season:

Pos Player              Ag   G   AB    R    H   2B 3B  HR  RBI  BB  SO   BA    OBP   SLG  SB  CS  GDP HBP  SH  SF IBB  OPS+
---+-------------------+--+----+----+----+----+---+--+---+----+---+----+-----+-----+-----+---+---+---+---+---+---+---+----+
C Bengie Molina 32 134 497 38 137 19 1 19 81 15 53 .276 .298 .433 0 0 13 2 1 2 2 86
1B *Ryan Klesko 36 116 362 51 94 27 3 6 44 46 68 .260 .344 .401 5 1 14 1 1 1 2 92
2B #Ray Durham 35 138 464 56 101 21 2 11 71 53 75 .218 .295 .343 10 2 18 2 0 9 6 65
3B Pedro Feliz 32 150 557 61 141 28 2 20 72 29 70 .253 .290 .418 2 2 15 1 0 3 2 81
SS #Omar Vizquel 40 145 513 54 126 18 3 4 51 44 48 .246 .305 .316 14 6 14 1 14 3 6 62
LF *Barry Bonds 42 126 340 75 94 14 0 28 66 132 54 .276 .480 .565 5 0 13 3 0 2 43 170
CF *Dave Roberts 35 114 396 61 103 17 9 2 23 42 66 .260 .331 .364 31 5 4 0 4 0 1 80
RF #Randy Winn 33 155 593 73 178 42 1 14 65 44 85 .300 .353 .445 15 3 12 7 4 5 3 105
Rich Aurilia 35 99 329 40 83 19 2 5 33 22 45 .252 .304 .368 0 0 8 4 0 3 1 73
Kevin Frandsen 25 109 264 26 71 12 1 5 31 21 24 .269 .331 .379 4 3 17 5 3 3 3 84
*Fred Lewis 26 58 157 34 45 6 2 3 19 19 32 .287 .374 .408 5 1 4 3 1 0 0 103
#Dan Ortmeier 26 62 157 20 45 7 4 6 16 7 41 .287 .317 .497 2 1 2 1 0 2 1 107
Rajai Davis 26 51 142 26 40 9 1 1 7 14 25 .282 .363 .380 17 4 0 4 2 0 1 93
*Nate Schierholtz 23 39 112 9 34 5 3 0 10 2 19 .304 .316 .402 3 1 0 1 0 2 0 85
*Mark Sweeney 37 76 90 18 23 8 0 2 10 13 18 .256 .368 .411 2 0 0 3 1 0 0 102
We'll focus on the last batting measure OPS that is a summary of a player's batting effectiveness.

We read the OPS values for the 15 players into R into the vector y.

> y
[1] 86 92 65 81 62 170 80 105 73 84 103 107 93 85 102

We assume that the observations y1, ..., y15 are iid from a t distribution with location , scale and 4 degrees of freedom. We place the usual noninformative prior on .

We implement 10,000 iterations of Gibbs sampling by use of the function robustt.R in the LearnBayes package. This function is easy to use -- we just input the data vector y, the degrees of freedom, and the number of iterations.

fit=robustt(y,4,10000)

The object fit is a list with components mu, sigma2, and lam -- mu is a vector of simulated draws of , sigma2 is a vector of simulated draws of , and lam is a matrix of simulated draws of the scale parameters . (The are helpful for identifying outliers in the data -- here the outlier is pretty obvious.)

Below I have graphed the data as solid dots and placed a density estimate of the posterior of on top. We see that the estimate of appears to ignore the one outlier (Barry Bonds) in the data.

plot(y,0*y,cex=1.5,pch=19,ylim=c(0,.1),ylab="DENSITY",xlab="MU")
lines(density(fit$mu),lwd=3,col="red")

By the way, we have assumed that robust modeling was suitable for this dataset since I knew that the Giants had one unusually good hitter on their team. Can we formally show that robust modeling the t(4) distribution is better than normal modeling for these data?

Sure -- we can define two models (with the choice of proper prior distributions) and compare the models by use of a Bayes factor. I'll illustrate this in my next posting.

Monday, November 26, 2007

Gibbs Sampling for Censored Regression

Gibbs sampling is very convenient for many "missing data" problems. To illustrate this situation, suppose we have the simple regression model

,

where the errors are iid . The problem is that some of the response variables are right-censored and we actually observe , where is a known censoring value. We know which observations are uncensored () and which observations are censored ().

We illustrate this situation by use of a picture inspired by a similar one in Martin Tanner's book. We show a scatterplot of the covariate against the observed . The points highlighted in red correspond to censored observations -- the actual observations (the ) exceed the censored values, which we indicate by arrows.


To apply Gibbs sampling to this situation, we imagine a complete data set where all of the 's are known. The unknowns in this problem are the regression coefficients , the error variance , and the 's corresponding to the censored observations.

The joint posterior of all unobservables (assuming the usual vague prior for regression) has the form

where is equal to 1 if the observation is not censored, and if the observation is censored at .

With the introduction of the complete data set, Gibbs sampling is straightforward.

Suppose one has initial estimates at the regression coefficients and the error variance. Then

(1) One simulates from the distribution of the missing data (the for the censored observations) given the parameters. Specifically is simulated from a normal() distribution censored below by .

(2) Given the complete data, one simulates values of the parameters using the usual approach for a normal linear regression model.

To do this on R, I have a couple of useful tools. The function rtruncated.R will simulate draws from an arbitrary truncated distribution

rtruncated=function(n,lo,hi,pf,qf,...)
qf(pf(lo,...)+runif(n)*(pf(hi,...)-pf(lo,...)),...)

For example, if one wishes to simulate 20 draws of a normal(mean = 10, sd = 2) distribution that is truncated below by 4, one writes

rtruncated(20, 4, Inf, pnorm, qnorm, 10, 2)

Also the function blinreg.R in the LearnBayes package will simulate draws of a regression coefficient and the error variance for a normal linear model with a noninformative prior.