Tuesday, October 30, 2007

A Prediction Contest

Concluding our baseball example, recall that we observed the home run rates for 20 players in the month of April and we were interested in predicting their home run rates for the next month. Since we have collected data for both April and May, we can check the accuracy of three prediction methods.

1. The naive method would be to simply use the April rates to predict the May rates. Recall that the data matrix is d where the first column are the at-bats and the second column are the home run counts.

pred1=d[,2]/d[,1]

2. A second method, which we called the "pooled" method, predicts each player's home run rate by the pooled home run rate for all 20 players in April.

pred2=sum(d[,2])/sum(d[,1])

3. The Bayesian method, predicts a player's May rate by his posterior mean of his true rate lambda_j.

pred3=post.means

One measure of accuracy is the sum of absolute prediction errors.

The May home run rates are stored in the vector may.rates. We first compute the individual prediction errors for all three methods.

error1=abs(pred1-may.rates)
error2=abs(pred2-may.rates)
error3=abs(pred3-may.rates)

We use the apply statement to sum the absolute errors.

errors=cbind(error1,error2,error3)
apply(errors,2,sum) # sum of absolute errors for all methods

error1 error2 error3
0.3393553 0.3111552 0.2622523

By this criterion, Bayes beats Pooled beats Naive.

Finally, suppose we are interested in predicting the number of home runs hit by the first player Chase Utley in May. Suppose we know that he'll have 115 at-bats in May.

1. We have already simulated 10,000 draws from Utley's true home run rate lambda1 -- these are stored in the first column of the matrix lam vector.

2. Then 10,000 draws from Utley's posterior predictive distribution are obtained by use of the rpois function.

ys1=rpois(10000,AB*lam[,1])

We graph this predictive distribution by the command

plot(table(ys1),xlab="NUMBER OF MAY HOME RUNS")

The most likely number of May home runs is 3, but a 90% prediction interval is [1, 9].

Saturday, October 27, 2007

Fitting an Exchangeable Model

Continuing our baseball example, we observe yi home runs in ei at-bats for the ith player. We assume

1. y1, ..., y20 are independent, yi is Poisson(lambda_i)
2. the true rates lambda_1,..., lambda_20 are independent Gamma(alpha, alpha/mu)
3. the hyperparameters alpha, mu are independent, mu is distributed 1/mu, alpha is distributed according to the proper prior z0/(alpha+z0)^2, where z0 is the prior median

The data is stored as a matrix d, where the first column are the ei and the second column are the yi. In our example, we let z0 = 1; that is, the prior median of alpha is one.

Here is our computing strategy.

1. First we learn about the hyperparameters (alpha, mu). The posterior of (log alpha, log mu) is programmed in the LearnBayes function poissgamexch. Here is a contour plot.

ycontour(poissgamexch,c(-1,8,-4.2,-2.5),datapar)
title(xlab="LOG ALPHA",ylab="LOG MU")


We use the function laplace to find the posterior mode and associated variance-covariance matrix. The output from laplace is used to find a proposal variance and scale parameter to use in a random walk Metropolis chain. We simulate 10,000 draws from the posterior of (log alpha, log mu). (This chain has approximately a 30% acceptance rate.)

fit=laplace(poissgamexch,c(2,-3.2),datapar)
proposal=list(var=fit$var,scale=2)
mcmcfit=rwmetrop(poissgamexch,proposal,c(1,-3),10000,datapar)

By exponentiating the simulated draws from mcmcfit, we get draws from the marginal posteriors of alpha and mu. We draw a density plot of alpha and superimpose the prior density of alpha.

alpha=exp(mcmcfit$par[,1])
mu=exp(mcmcfit$par[,2])
plot(density(alpha,adjust=2),xlim=c(0,20),col="blue",lwd=3,xlab="ALPHA",
main="POSTERIOR OF ALPHA")

prior=function(alpha,z0) z0/(alpha+z0)^2
theta=seq(.0001,20,length=200)
lines(theta,prior(theta,1),col="red",lwd=3)


2. Now we can learn about the true rate parameters. Conditional on hyperparameter values, lambda_1, ..., lambda_20 have independent gamma posteriors.

We write a short function to simulate draws of lambda_j for a particular player j.

trueratesim=function(j,data,alpha,mu)
{
e=data[,1]; y=data[,2]
rgamma(length(alpha),shape=alpha+y[j],rate=alpha/mu+e[j])
}

Then we can simulate draws of all the true rates by use of the sapply function.

lam=sapply(1:20,trueratesim,d,alpha,mu)

The output lam is a 10,000 by 20 matrix. We can compute posterior means by the apply function.

post.means=apply(lam,2,mean)

To show the behavior of the posterior means, we draw a plot where
(1) we show the observed rates yi/ei as red dots
(2) we show the posterior means as blue dots
(3) a horizontal line is draw at the mean home run rates for all 20 players.

Thursday, October 25, 2007

Predicting home run rates

Here is a simple prediction problem. Suppose we observe the number of home runs and the number of at-bats for 20 baseball players during the first month of the baseball season (April).

We observe corresponding home run rates: Utley 4/88, Weeks 1/77, Aurila 4/71, and so on.

From this data, we want to predict the home run rates for the same 20 players for the next month (May).

How can we best do this? Here are some opening comments.

1. One idea would be to estimate the May rates just by using the April rates. So we predict Utley's May rate to be 4/88, Weeks rate to be 1/77, etc. This may not be a good idea since we have limited data for each player.

2. Or maybe a better strategy would be to combine the data. Collectively in April, this group hit 60 home runs in 1697 at-bats for a rate of 60/1697 = 0.035. We could estimate each player's May rate by 0.035. But this would ignore the fact that players have different abilities to hit home runs.

3. Actually, the best prediction strategy is a compromise between the first two ideas. A good plan is to predict a player's May rate by a "shrinkage" estimate that shrinks a player's individual rate towards the combined rate.

In the following postings, we'll illustate how to fit a Bayesian exchangeable model to these data that gives a good prediction strategy.

Tuesday, October 23, 2007

View of an Exchangeable Prior

In Chapter 7, we consider the following exchangeable prior for Poisson rates lam_1, ..., lam_k that is described in two stages.

Stage I. Conditional on parameters alpha, mu, lam_1, ..., lam_k are independent Gamma(alpha, alpha/mu)

Stage II. The parameters (alpha, mu) come from a specified prior g(alpha, mu).

Here mu is the prior mean of lam_i and alpha is a precision parameter. This structure induces the following prior on lam_1, .., lam_k:

g(lam_1, ..., lam_k) = integral prod P(lam_j | alpha, mu) g(alpha, mu) dalpha dmu.

To see how this prior reflects dependence between the parameters, suppose we fix alpha to the value alpha_0 and let mu be distributed inverse gamma(a, b). Then one can show the prior on lam_1,..., lam_k is given (up to a proportionality constant) by

g(lam_1, ..., lam_k) = P^(alpha_0-1)/(alpha_0 S + b)^(k alpha_0 + a),

where P is the product of lam_j and S is the sum of lam_j.

To see this prior, we program a simple function pgexchprior that computes the logarithm of the prior of lam_1 and lam_2 given parameter values (alpha_0, a, b).

pgexchprior=function(lambda,pars)
{
alpha=pars[1]
a=pars[2]
b=pars[3]
(alpha-1)*log(prod(lambda))-(2*alpha+a)*log(alpha*sum(lambda)+b)
}

The following R commands construct contour plots of the prior for lam_1 and lam_2 for the precision parameters alpha_0 = 5, 20, 80, and 200. (In each case, we assign mu an inverse-gamma (10, 10) prior.)

alpha=c(5,20,80,400)
par(mfrow=c(2,2))
for (j in 1:4)
{
mycontour(pgexchprior,c(.001,5,.001,5),c(alpha[j],10,10))
title(main=paste("ALPHA = ",alpha[j]),xlab="LAMBDA1",ylab="LAMBDA2")
}

These plots clearly show that, as alpha increases, the prior induces stronger correlation between the two Poisson rates.

Monday, October 22, 2007

Illustration of posterior predictive checking

To illustrate posterior predictive checking, suppose we observe data y1,...,yn that we assume is N(mu, sigma). We place a noninformative prior on (mu, sigma) and learn about the parameters based on the posterior distribution. We use the posterior predictive distribution to check out model -- to see if our sample is consistent with samples predicted from our fitted model.

In practice, we construct a checking function d that is a function of the future sample y*. In this example (66 measurements of the speed of light), we suspect that the smallest observation is an outlier. So we use a checking function d(y) = y_min.

Here's an R procedure for checking our model using this diagnostic.

We load in the speed of light data:

> y=scan("speedoflight.txt")
Read 66 items
> y
[1] 28 22 36 26 28 28 26 24 32 30 27 24 33 21 36 32 31 25 24
[20] 25 28 36 27 32 34 30 25 26 26 25 -44 23 21 30 33 29 27 29
[39] 28 22 26 27 16 31 29 36 32 28 40 19 37 23 32 29 -2 24 25
[58] 27 24 16 29 20 28 27 39 23

We simulate 1000 draws from the normal/scale x inv-chi-square posterior using the function normpostsim.

parameters=normpostsim(y,1000)

The output is a list -- here parameters$mu contains draws from mu, and parameters$sigma2 contains draws of sigma2.

The simulation of samples from the posterior predictive distribution is done by the function normpostpred. By use of comments, we explain how this function works.

normpostpred=function(parameters,sample.size,f=min)
{
# the function normalsample simulates a single sample given parameter values mu and sigma2.
# the index j is the index of the simulation number

normalsample=function(j,parameters,sample.size)
rnorm(sample.size,mean=parameters$mu[j],sd=sqrt(parameters$sigma2[j]))

# we use the sapply command to simulate many samples, where the number of samples
# corresponds to the number of parameter simulations

m=length(parameters$mu)
post.pred.samples=sapply(1:m,normalsample,parameters,sample.size)

# on each posterior predictive sample we compute a stat, the default stat is min

stat=apply(post.pred.samples,2,f)

# the function returns all of the posterior predictive samples and a vector of values of the
# checking diagnostic.

return(list(samples=post.pred.samples,stat=stat))
}

Here the size of the original sample is 66. To generate 1000 samples of size 66 from the posterior predictive distribution, and storing the mins of the samples, we type

post.pred=normpostpred(parameters,66)

The values of the minimum observation in all samples is stored in the vector post.pred$stat. We display the mins in a histogram and draw a vertical line showing the location of the min of the data.

> hist(post.pred$stat,xlim=c(-50,15))
> lines(min(y)*c(1,1),c(0,350),lwd=3,col="red")

Clearly the smallest observation is inconsistent with the mins generated from samples from the posterior predictive distribution. This cast doubts on the normality assumption of the data.

Saturday, October 20, 2007

No more loops!

One aspect of the LearnBayes package that I've been uncomfortable with is the use of loops in my definition of the functions defining log posteriors. I had a reason for using the loops (the theta argument to the function could be a matrix), but it is generally bad programming practice. The worst aspect of the looping is that it makes the process of a writing a posterior more difficult than it really is. Looping is bad from the user's perspective. After all, we are teaching statistics not programming, and I want to encourage people to code the posteriors for their problems using R.

Here is a simple example. Suppose your model is that y1,..., yn are independent Cauchy with location mu and scale sigma. The log posterior is given by

log g = sum (log f),

where log f is the log of the Cauchy density conditional on parameters. My old way of programming the posterior had the loop

for (i in 1:length(data))
val = val + log(dt((data[i] - mu)/sigma, df = 1)/sigma)

I think this new way is preferable. First you define the function logf for a single observation y:

logf=function(y,mu,sigma) log(dt((y-mu)/sigma,df=1)/sigma)

Then the log posterior is given by

sum(logf(data,mu,sigma))

Anyway, I think that by avoiding loops, the function for the log posterior becomes more transparent.

The new version of the LearnBayes package will contain fewer loops.

Wednesday, October 17, 2007

Fitting a Mixture Sampling Model, Part II

In the previous post, we introduced the mixture sampling model and showed that the posterior had an interesting bimodal shape. Here we illustrate the use of two Metropolis random walk algorithms to sample from this posterior.

Recall that the definition of the log posterior is in poisson.mix.R and the data is contained in the vector y.

We begin with a random walk algorithm using an identity var-cov matrix and a scale constant of 0.01. We begin at value (3, 3) and run for 10,000 iterations. We store the fit in the variable fit1.

proposal=list(var=diag(c(1,1)),scale=.01)
start=array(c(3,3),c(1,2))
fit1=rwmetrop(poisson.mix,proposal,start,10000,list(y=y,p=.4))

In our second algorithm, we also use an identity var-cov matrix, but use the large scale constant of 0.2.

proposal=list(var=diag(c(1,1)),scale=.2)
start=array(c(3,3),c(1,2))
fit2=rwmetrop(poisson.mix,proposal,start,10000,list(y=y,p=.4))

The acceptance rates of the two algorithms are:

Acceptance rate
fit1 94%
fit2 23%

One basic method of MCMC output analysis uses trace plots which are time series plots of the simulated draws. We show trace plots for each random walk below.

Random walk (scale = 0.01)

Note the snake-like appearance of this graph. This is typical of MCMC random walk runs with high acceptance rates.

Random walk (scale = 0.2)

This plot displays better mixing and it visits both areas of concentration of the posterior.

To check the accuracy of each chain in simulating the marginal posterior of theta1 = log(lambda1), we first use the function simcontour that is based on simulations from the grid that covers the actual posterior. A density estimate on the draws of theta1 based on this "exact" algorithm is presented as a red line and the density estimate of the draws of theta1 from the MCMC algorithm are presented in blue.

We first show the random walk with scale 0.01 and then the random walk with scale 0.2.



In this example, it seems that the second MCMC run produced a better approximation to the marginal posterior density of theta1. The first run with scale 0.01 actually missed the important area of the posterior.

Fitting a Mixture Sampling Model

To illustrate the application of a MCMC random walk algorithm, consider the following mixture model. Suppose we observe season home run counts y1, ..., y20 from a group of baseball players. Some of these players are "sluggers" who average many home runs (per season) and other players are "light hitters" who don't hit as many home runs. We know that 40% of all players are sluggers, but we don't know the identity (slugger or non-slugger) for each player. The sampling density for the ith player's home run count yi has the density

f(yi | lambda1, lambda2) = p f(yi | lambda1) + (1- p) f(yi | lambda2),

where we assume f(yi | lambda) is Poisson with mean lambda and we assume we know p = .4.

Suppose we assume that (lambda1, lambda2) has the prior 1/(lambda1 lambda2). Then the posterior is given (up to a proportionality constant) by

g(lambda1, lambda2 | y) = 1/(lambda1 lambda2) prod from i=1 to 20 [p f(yi | lambda1) + (1- p) f(yi | lambda2)]

Here is some simulated data

25 42 29 29 28 25 21 34 19 11 16 13 15 17 10 16 21 17 16 20

We first transform the parameters to the real-valued parameters

theta1 = log(lambda1), theta2 = log(lambda2)

and write the following function that computes the posterior of (theta1, theta2). Here the input variable data.p is a list that contains two elements: p, the known value of p, and y, the vector of observations.

poisson.mix=function(theta, data.p)
{
lambda1=exp(theta[,1])
lambda2=exp(theta[,2])
y=data.p$y
p=data.p$p
val=0*lambda1
for (i in 1:length(y))
val=val+log(p*dpois(y[i],lambda1)+(1-p)*dpois(y[i],lambda2))
return(val)
}

The data is stored in the vector y. We use the mycontour function to graph the bivariate posterior. We also show a perspective plot of this posterior.

mycontour(poisson.mix,c(2.2,3.8,2.2,3.8),list(y=y,p=.4))
You might be surprised to see that the posterior is bimodal. This is happening since the model is not well-defined. But in the next posting, we'll ignore this identifiability problem, and see if a random walk Metropolis can be successful in sampling from this posterior.

Sunday, October 14, 2007

Choosing the scale for a Metropolis RW algorithm

Let's return to the example where we are sampling from a Cauchy density with unknown median theta and known scale parameter 1 and we place a uniform prior on theta. We observe the data

1.3 1.9 3.2 5.1 1.4 5.8 2.5 4.7 2.9 5.2 11.6 8.8 8.5 10.7 10.2
9.8 12.9 7.2 8.1 9.5

An attractive method of obtaining a simulated sample from this posterior is the Metropolis random walk algorithm. If theta^(t-1) represents the current simulated draw, then the next candidate is generated from the distribution

theta^(t) = theta^(t-1) + c Z,

where Z is standard normal. The main issue here is the selection of the scale constant c.

I did a simple experiment. I simulated samples of 10,000 draws using the scale values 0.2, 1, 5, 25. Here's the R code for the Metropolis random walk with the choice of scale parameter 0.2.

cpost=function(theta,y)
{
val=0*theta
for (j in 1:length(y))
val=val+dt(y[j]-theta,df=1,log=TRUE)
return(val)
}
proposal=list(var=1,scale=.2)
fit1=rwmetrop(cpost,proposal,20,10000,data)

The acceptance rates for these four values are given by

scale acceptance rate
0.2 93%
1.0 73%
5.0 31%
25 7%

To assess the accuracy of a particular sample, we draw the exact posterior in red, and superimpose a density estimate of the simulated draws in blue.


Comparing these four figures, the scale values of 1 and 5 seem to do pretty well.

Wednesday, October 10, 2007

Summarizing a posterior

Continuing our selected data example, remember that we have programmed the posterior of the transformed parameters theta1 and theta2 in the R function kaminsky.

To find a normal approximation to the posterior, we apply the function laplace in the LearnBayes package. The inputs are (1) the function defining the log posterior, (2) a starting guess at the mode, (3) the number of iterations of the Newton algorithm, and (4) the data vector (the 5th and 15th order statistics).

> start=array(c(-2,-1),c(1,2))
> fit.laplace=laplace(kaminsky,start,10,data)

The output is a list containing mode, the value of the posterior mode, and var, the estimate at the variance-covariance matrix.

> fit.laplace$mode
[,1] [,2]
[1,] -2.367745 -1.091989
> fit.laplace$var
[,1] [,2]
[1,] 0.3201467 0.1191059
[2,] 0.1191059 0.1191059

We can get more accurate summaries of the posterior by means of a Metropolis random walk algorithm. The function rwmetrop implements this algorithm for an arbitrary posterior. To use this function, we define "proposal", a list containing the variance and scale parameter for the normal proposal density, the starting value for the MCMC chain, the number of simulated draws, and the data vector. Note that we are using the approximate variance-covariance matrix from laplace in the proposal density for rwmetrop.

> proposal=list(var=fit.laplace$var,scale=2)
> fit.mcmc=rwmetrop(kaminsky,proposal,start,10000,data)

The output of rwmetrop is a list containing accept, the acceptance rate for the chain, and par, the matrix of simulated draws.

At this point, we should run some convergence diagnostics to see if the simulated draws show sufficient mixing and don't display unusually high autocorrrelations. For this example, the acceptance rate is about 29% which is within the acceptable range for this algorithm.

We display the simulated draws on top of the contour plot of theta1 and theta2 -- it seems that that most the simulated draws fall within the first contour line.

> mycontour(kaminsky,c(-5,0,-2.5,1),data)
> title(xlab="LOG(Y5-MU)",ylab="LOG BETA")
> points(fit.mcmc$par[,1],fit.mcmc$par[,2])


We are interested in the parameters mu and beta. We first compute vectors of simulated draws of mu and beta by transforming back the simulated draws of theta1 and theta.

> MU=data[1]-exp(fit.mcmc$par[,1])
> BETA=exp(fit.mcmc$par[,2])

We display the marginal posteriors of mu and beta.

> par(mfrow=c(2,1))
> plot(density(MU),lwd=3,main="POSTERIOR OF MU",xlab="MU")
> plot(density(BETA),lwd=3,main="POSTERIOR OF BETA",xlab="BETA")



We construct 90% interval estimates by extracting quantiles from the collection of simulated draws.

> quantile(MU,c(.05,.95))
5% 95%
9.864888 10.065765
> quantile(BETA,c(.05,.95))
5% 95%
0.2012635 0.6520858

Last, suppose we are interested in predicting the 5th order statistic ys5 from a future sample of 20 observations.

To simulate from the distribution of ys5, we (1) simulate (mu, beta) from the posterior and then (2) simulate a future sample y1,...,y20 from the exponential distribution with parameters mu and beta, and (3) storing the 5th ordered observation from the simulated sample. We repeat this process 1000 times, obtaining a simulated sample from ys5. We display this predictive distribution by a histogram.

ys5=rep(0,1000)
for (j in 1:1000)
{
ys=rexp(20,rate=1/BETA[5000+j])+MU[5000+j]
ys5[j]=sort(ys)[5]
}
hist(ys5,col="orange")

Learning from selected order statistics

To illustrate some computational methods for summarizing a posterior, I describe a missing data problem motivated from my undergraduate days at Bucknell. Ken Kaminsky and Paul Nelson, two of my Bucknell professors, were interested in learning about populations based on selected order statistics. (I wrote an undergraduate honors thesis on this topic. ) Here is a simple illustration of the problem.

Suppose a sample y1, ..., y20 is taken from the two-parameter exponential distribution of the form f(y | mu, beta) =1/beta exp(-(y-mu)/beta), y > mu. But you don't observe the complete dataset -- all you observe are the two order statistics y(5) and y(15) (the order statistics are the observations arranged in ascending order).

Based on this selected data, we wish to (1) estimate the parameters mu and beta by 90% interval estimates and (2) predict the value of the order statistics y*(5) and y*(20) from a future sample taken from the same population.

Here's the plan:

1. First, we write the likelihood which is the density of the observed data (y(5) and y(20)) given values of the exponential parameters mu and beta. One can show that this likelihood is given by

L(mu, beta) = f(y(5)) f(y(15)) F(y(5))^4 (F(y(15) )-F(y(5)))^9 (1- P(y(15)))^5, mu>0, beta>0.

2. Assuming a flat (uniform) prior on (mu, beta), the posterior density is proportional to the likelihood. We write a R function kaminsky0.R that computes the logarithm of the posterior -- here the parameters are (mu, beta) and the data is (y(5), y(15)).

kaminsky0=function(theta,data)
{

f=function(y,mu,beta)
return(dexp(y-mu,rate=1/beta))
F=function(y,mu,beta)
return(pexp(y-mu,rate=1/beta))

y5=data[1]; y15=data[2]
mu=theta[,1]
beta=theta[,2]

loglike=log(f(y5,mu,beta))+log(f(y15,mu,beta))+
4*log(F(y5,mu,beta))+9*log(F(y15,mu,beta)-F(y5,mu,beta))+
5*log(1-F(y15,mu,beta))

return(loglike)
}

3. Graphing the posterior of (mu, beta), we see strong skewness in both parameters.
It is usually helpful to transform to real-valued parameters

theta1 = log(y(5) - mu) , theta1 = log(beta).

We write the following function kaminsky.R that computes the log posterior of (theta1, theta2).

kaminsky=function(theta,data)
{

f=function(y,mu,beta)
return(dexp(y-mu,rate=1/beta))
F=function(y,mu,beta)
return(pexp(y-mu,rate=1/beta))

y5=data[1]; y15=data[2]
mu=y5-exp(theta[,1])
beta=exp(theta[,2])

loglike=log(f(y5,mu,beta))+log(f(y15,mu,beta))+
4*log(F(y5,mu,beta))+9*log(F(y15,mu,beta)-F(y5,mu,beta))+
5*log(1-F(y15,mu,beta))

logjack=theta[,1]+theta[,2]

return(loglike+logjack)
}

Here's a graph of the posterior of the reexpressed parameters -- note that it is much more normal-shaped.


4. We'll use several functions in the next posting to summarize the posterior.

(a) The laplace function is useful in finding the posterior mode and normal approximation to the posterior.

(b) By use of the rwmetrop function, we construct a random-walk Metropolis algorithm to simulate from the joint posterior.

Sunday, October 7, 2007

The SIR algorithm

There is a simple method, called the SIR algorithm, of taking a simulated sample of draws from one distribution p, and using these draws to produce a sample from a different distribution g. We illustrate this method for the Cauchy sampling model example introduced in the last post.

Suppose that we have a proposal density g(theta) that we believe is a rough approximation to the posterior (in terms of location and spread). Here we suppose that a t density with mean 7, variance 9 and degrees of freedom 3 is a rough approximation to our bimodal posterior density for theta.

There are three steps in the SIR algorithm.

1. (S) We Sample 1000 draws from the proposal density p. We are storing these in the vector theta.p.

theta.p=sqrt(VAR)*rt(1000,DF)+MEAN

2. (I) We compute Importance sampling weights for this sample equal to the ratios of the target density (g) to the proposal density (p).

p.theta=dt(theta.p-MEAN,DF)/sqrt(VAR)
g.theta=exp(cpost(theta.p,y))
weights=g.theta/p.theta

The following figure plots the simulated draws from the proposal density against the weights.

plot(theta.p,weights,xlim=c(2,12),xlab="THETA",ylab="POST/PROPOSAL")


3. (R) We resample 1000 draws with replacement from the simulated draws theta.p, where the sampling probabilities are proportional to the weights.

probs=weights/sum(weights)
theta.sample=sample(theta.p,size=1000,prob=probs,replace=TRUE)

The values in the vector theta should be (approximately) from the posterior density g.

The sir algorithm with a t proposal density is implemented in the function sir.R in the LearnBayes algorithm.

To illustrate this function, remember the definition of the log posterior is in cpost and the data is stored in the vector y. We create a list tpar that contains the components of the t proposal density.

MEAN=7; VAR=9; DF=3
tpar=list(m=MEAN,var=VAR,df=DF)

Then we implement the algorithm using the sir function -- the output is a vector of simulated draws from the posterior.

s=sir(cpost,tpar,1000,y)

Tuesday, October 2, 2007

Robust Modeling

To illustrate some of the computational methods to summarize a posterior, suppose that we observe a sample y1, ..., yn from a Cauchy density with location theta and scale parameter 1. If we assign a uniform prior to theta, then the posterior density of theta is proportional to

product from i=1 to n [ (1 + (yi - theta))^{-1} ].

Suppose we observe the following 20 values:

1.3 1.9 3.2 5.1 1.4 5.8 2.5 4.7 2.9 5.2 11.6 8.8 8.5 10.7 10.2
9.8 12.9 7.2 8.1 9.5

In R, we define a new function cpost.R that computes the logarithm of the posterior density.

cpost=function(theta,y)
{
val=0*theta
for (j in 1:length(y))
val=val+dt(y[j]-theta,df=1,log=TRUE)
return(val)
}

We compute and display the posterior on the interval [2, 12].

Note the interesting bimodal shape of the posterior. Clearly a normal approximation to this posterior will not be an accurate representation.

Suppose we wish to simulate a sample from this posterior. A general method for generating a sample is the reject algorithm. To construct a reject algorithm, we find a suitable p that is easily to simulate from and covers the posterior g in the sense that g(theta) <= c p(theta) for all theta. Then one simulates a variate u from a uniform(0, 1) distribution; if u < [ g(u)/c/p(u)], then we accept u as a draw from the target distribution g. Suppose we let p be a t density with mean mu, variance v, and degrees of freedom df. Here we let

MEAN=7; VAR=9; DF=3

To find the bounding constant, we plot the logarithm of the ratio, log g(theta) - log p(theta) over the range of theta.

p.density=dt(theta-MEAN,DF)/sqrt(VAR)
plot(theta,log(post/p.density),type="l") From inspection of the graph (and a little calculation), it appears that the log of the ratio is bounded above by -62.98.

In the R code, we draw the posterior density and the constant x proposal density that covers the posterior.

The function rejectsampling.R will implement reject sampling with a t covering density. The inputs to this function are (1) the definition of the log target density, (2) a list giving the parameters of the t covering density (mean, variance, df), (3) the value of the log of the bounding constant, (4) the number of values simulated from the proposal density, and (5) the data used in the target function. The output of this function is a vector of values from the target distribution.

tpar=list(m=MEAN,var=VAR,df=DF)
dmax=-62.98
n=10000
theta.sim=rejectsampling(cpost,tpar,dmax,n,y)

One can compute the acceptance rate of this algorithm by dividing the length of the output vector theta.sim by the number of simulated draws from the proposal. The acceptance rate for this example is about 12%. To demonstrate that this algorithm works, we draw in the following figure (1) the exact posterior in red and (2) a density estimate of about 6000 simulated draws in blue. I'm convinced that the algorithm has indeed produced a sample from the posterior distribution.

Monday, October 1, 2007

Normal approximations to posteriors

Continuing our Phillies example, I'm going to change the example somewhat and consider the relationship between the number of runs the Phillies score and the game outcome. Here is the table:

runs scored
game outcome low (four or less) high (five or more)
Win 15 74
Loss 53 20

There appears to be a notable positive relationship here and we're interested in estimating the underlying correlation coefficient of the bivariate normal.

We first construct a function polycorr.R that computes the logarithm of the posterior of the correlation.

polycorr=function(rho,data)
{

# needs package mvtnorm

n11=data[1,1]; n12=data[1,2]; n21=data[2,1]; n22=data[2,2]

nA=n11+n12; nB=n11+n21; n=n11+n12+n21+n22

c=qnorm(nA/n); d=qnorm(nB/n); pc=nA/n; pd=nB/n

val=0*rho

for (j in 1:length(rho))

{

C=matrix(c(1,rho[j],rho[j],1),c(2,2))

p00=pmvnorm(lower=-Inf*c(1,1),upper=c(c,d),corr=C)

val[j]=n11*log(pc-p00)+n12*log(1-pc-pd+p00)+n21*log(p00)+n22*log(pd-p00)
}
return(val)

}

We input the data as a two by two matrix.

data=matrix(c(15,53,74,20),c(2,2))

We find the normal approximation by use of the laplace function in the LearnBayes package. The inputs are the function, the starting value for the Newton algorithm, the number of iterations of the algorithm, and the data used in the function.


fit=laplace(polycorr,.6,10,data)

From the output of this function, we get that rho is approximately N(.694, .00479). In the R code below we plot the exact and approximate posteriors. In the figure, we see some inaccuracy in the normal approximation.

rho=seq(.3,1,by=.01)
gpost=exp(polycorr(rho,data))

plot(rho,gpost/sum(gpost)/.01,type="l",lwd=3,col="red",
ylab="DENSITY",xlab="RHO")
lines(rho,dnorm(rho,fit$mode,sqrt(fit$var)),lwd=3,col="blue")

legend(locator(1),c("EXACT","NORMAL APPROX"),col=c("red","blue"),lwd=2)



One way of improving the accuracy of the normal approximation is to transform rho to the real-valued parameter theta = log [(rho+1)/(1-rho)]. We write a function to compute the log posterior of theta.

polycorr2=function(theta,data)
{

# needs package mvtnorm

rho=(exp(theta)-1)/(exp(theta)+1)

n11=data[1,1]; n12=data[1,2]; n21=data[2,1]; n22=data[2,2]

nA=n11+n12; nB=n11+n21; n=n11+n12+n21+n22

c=qnorm(nA/n); d=qnorm(nB/n); pc=nA/n; pd=nB/n

val=0*rho

for (j in 1:length(rho))

{

C=matrix(c(1,rho[j],rho[j],1),c(2,2))
p00=pmvnorm(lower=-Inf*c(1,1),upper=c(c,d),corr=C) val[j]=n11*log(pc-p00)+n12*log(1-pc-pd+p00)+n21*log(p00)+n22*log(pd-p00) }
return(val+log(1-rho)+log(1+rho))
}


We find the normal approximation using the function laplace. Here the approximation is that the transformed rho is normal with mean 1.662 and variance .0692.

fit1=laplace(polycorr2,0,10,data)

We plot the exact and approximate posteriors -- here the normal approximation appears very accurate.

theta=seq(0.4,3.0,by=.01)
gpost=exp(polycorr2(theta,data))
plot(theta,gpost/sum(gpost)/.01,type="l",lwd=3,col="red", ylab="DENSITY",xlab="THETA") lines(theta,dnorm(theta,fit1$mode,sqrt(fit1$var)),lwd=3,col="blue")
legend(locator(1),c("EXACT","NORMAL APPROX"),col=c("red","blue"),lwd=2)






Tribute to the Phillies

As some of you might know, the Philadelphia Phillies are in the Major League Baseball playoffs which is pretty amazing. So we'll have to fit a model to some Phillies data. For each game of the 2007 season, we'll record

(1) if they won or lost the game
(2) the margin of victory which is equal to the winners score minus the losers score

We are interested in exploring the relationship between these two variables. Suppose we classify the margin of victory as "close" (3 runs or less) or a "blowout" (4 runs or more). Here is a 2 x 2 contingency table classifying all games by result and margin of victory

margin
close blowout
L 44 29
W 51 38

One of the oldest approaches to estimating the relationship between two ordinal variables is the polychoric coefficient. One assumes that there is an underlying bivariate normal distribution with zero means, unit variances and correlation rho.

The observed counts are found by dividing this continuous measure by the cutpoints c (on the x scale) and d (on the y scale). One can estimate the cutpoints from the data (here one solves Phi(c) = 63/162 and Phi(d) = 95/162, and the likelihood of the correlation coefficient rho is given by

L(rho) = p1^44 p2^29 p3^51 p4^38,

where p1, p2, p3, p4 are the probabilities (dependent on rho) that the bivariate normal falls in the four regions divided by the cutpoints c and d. If we place a uniform prior on rho, then the posterior density will be proportion to the likelihood.

We'll use this example to illustrate different computational approaches to summarizing the posterior distribution.