Thursday, November 29, 2007

Robust Modeling using Gibbs Sampling

To illustrate the use of Gibbs sampling for robust modeling, here are the batting statistics for the "regular" players on the San Francisco Giants for the 2007 baseball season:

Pos Player              Ag   G   AB    R    H   2B 3B  HR  RBI  BB  SO   BA    OBP   SLG  SB  CS  GDP HBP  SH  SF IBB  OPS+
---+-------------------+--+----+----+----+----+---+--+---+----+---+----+-----+-----+-----+---+---+---+---+---+---+---+----+
C Bengie Molina 32 134 497 38 137 19 1 19 81 15 53 .276 .298 .433 0 0 13 2 1 2 2 86
1B *Ryan Klesko 36 116 362 51 94 27 3 6 44 46 68 .260 .344 .401 5 1 14 1 1 1 2 92
2B #Ray Durham 35 138 464 56 101 21 2 11 71 53 75 .218 .295 .343 10 2 18 2 0 9 6 65
3B Pedro Feliz 32 150 557 61 141 28 2 20 72 29 70 .253 .290 .418 2 2 15 1 0 3 2 81
SS #Omar Vizquel 40 145 513 54 126 18 3 4 51 44 48 .246 .305 .316 14 6 14 1 14 3 6 62
LF *Barry Bonds 42 126 340 75 94 14 0 28 66 132 54 .276 .480 .565 5 0 13 3 0 2 43 170
CF *Dave Roberts 35 114 396 61 103 17 9 2 23 42 66 .260 .331 .364 31 5 4 0 4 0 1 80
RF #Randy Winn 33 155 593 73 178 42 1 14 65 44 85 .300 .353 .445 15 3 12 7 4 5 3 105
Rich Aurilia 35 99 329 40 83 19 2 5 33 22 45 .252 .304 .368 0 0 8 4 0 3 1 73
Kevin Frandsen 25 109 264 26 71 12 1 5 31 21 24 .269 .331 .379 4 3 17 5 3 3 3 84
*Fred Lewis 26 58 157 34 45 6 2 3 19 19 32 .287 .374 .408 5 1 4 3 1 0 0 103
#Dan Ortmeier 26 62 157 20 45 7 4 6 16 7 41 .287 .317 .497 2 1 2 1 0 2 1 107
Rajai Davis 26 51 142 26 40 9 1 1 7 14 25 .282 .363 .380 17 4 0 4 2 0 1 93
*Nate Schierholtz 23 39 112 9 34 5 3 0 10 2 19 .304 .316 .402 3 1 0 1 0 2 0 85
*Mark Sweeney 37 76 90 18 23 8 0 2 10 13 18 .256 .368 .411 2 0 0 3 1 0 0 102
We'll focus on the last batting measure OPS that is a summary of a player's batting effectiveness.

We read the OPS values for the 15 players into R into the vector y.

> y
[1] 86 92 65 81 62 170 80 105 73 84 103 107 93 85 102

We assume that the observations y1, ..., y15 are iid from a t distribution with location , scale and 4 degrees of freedom. We place the usual noninformative prior on .

We implement 10,000 iterations of Gibbs sampling by use of the function robustt.R in the LearnBayes package. This function is easy to use -- we just input the data vector y, the degrees of freedom, and the number of iterations.

fit=robustt(y,4,10000)

The object fit is a list with components mu, sigma2, and lam -- mu is a vector of simulated draws of , sigma2 is a vector of simulated draws of , and lam is a matrix of simulated draws of the scale parameters . (The are helpful for identifying outliers in the data -- here the outlier is pretty obvious.)

Below I have graphed the data as solid dots and placed a density estimate of the posterior of on top. We see that the estimate of appears to ignore the one outlier (Barry Bonds) in the data.

plot(y,0*y,cex=1.5,pch=19,ylim=c(0,.1),ylab="DENSITY",xlab="MU")
lines(density(fit$mu),lwd=3,col="red")

By the way, we have assumed that robust modeling was suitable for this dataset since I knew that the Giants had one unusually good hitter on their team. Can we formally show that robust modeling the t(4) distribution is better than normal modeling for these data?

Sure -- we can define two models (with the choice of proper prior distributions) and compare the models by use of a Bayes factor. I'll illustrate this in my next posting.

No comments: