Yesterday Chris Rump at BGSU gave an interesting presentation about simulating the 2008 Presidential Election. He was explaining the methodology used by Nate Silver in the fivethirtyeight.com site.
Here is a relatively simple Bayesian approach for estimating the number of electoral votes that Barack Obama will get in the election on Tuesday.
First, using the polling data on cnn.com, I collected the percentages for McCain and Obama in the latest poll in each state. The web site only gives the survey percentages and not the sample sizes. A typical sample size in an election size is 1000 -- I will assume that each sample size is 500. This is conversative and it allows for some changes in voting behavior in the weeks before Election Day.
Suppose 500 voters in Ohio are sampled and 47% are for McCain and 51% are for Obama -- this means that 235 and 255 voters were for the two candidates. Let p.M and p.O denote the proportion of the voting population in Ohio for the two candidates -- 1 - p.M - p.O denote the proportion of the population for someone else. Assuming a vague prior on (p.M, p.O, 1-p.M-p.O), the posterior distribution for the proportions is proportional to
p.M^ 235 p.O^255 (1-p.M - p.O)^10
which is a Dirichlet distribution. The probability that McCain wins the election is simply the posterior probability
P(p.M > p.O)
For each state, I can easily estimate this probability by simulation. One simulates 5000 draws from a Dirichlet distribution and computes the proportion of draws where p.M > p.O.
The following table summarizes my calculations. For each state, I give the percentage of voters for McCain and Obama in the latest poll and my computed probability that McCain wins the state based on this data.
State M.pct O.pct prob.M.wins EV
1 Alabama 58 36 1.000 9
2 Alaska 55 37 1.000 3
3 Arizona 53 46 0.946 10
4 Arkansas 53 41 0.997 6
5 California 33 56 0.000 55
6 Colorado 45 53 0.032 9
7 Connecticut 31 56 0.000 7
8 Delaware 38 56 0.000 3
9 D.C. 13 82 0.000 3
10 Florida 47 51 0.189 27
11 Georgia 52 47 0.869 15
12 Hawaii 32 63 0.000 4
13 Idaho 68 26 1.000 4
14 Illinois 35 59 0.000 21
15 Indiana 45 46 0.416 11
16 Iowa 42 52 0.009 7
17 Kansas 63 31 1.000 6
18 Kentucky 55 39 1.000 8
19 Louisiana 50 43 0.949 9
20 Maine 35 56 0.000 4
21 Maryland 39 54 0.000 10
22 Massachusetts 34 53 0.000 12
23 Michigan 36 58 0.000 17
24 Minnesota 38 57 0.000 10
25 Mississippi 46 33 1.000 6
26 Missouri 50 48 0.675 11
27 Montana 48 44 0.825 3
28 Nebraska 43 45 0.329 5
29 Nevada 45 52 0.058 5
30 New Hampshire 39 55 0.000 4
31 New Jersey 36 59 0.000 15
32 New Mexico 40 45 0.117 5
33 New York 31 62 0.000 31
34 North Carolina 46 52 0.088 15
35 North Dakota 43 45 0.318 3
36 Ohio 47 51 0.182 20
37 Oklahoma 61 34 1.000 7
38 Oregon 34 48 0.000 7
39 Pennsylvania 43 55 0.004 21
40 Rhode Island 31 45 0.000 4
41 South Carolina 59 37 1.000 8
42 South Dakota 48 41 0.951 3
43 Tennessee 55 39 1.000 11
44 Texas 57 38 1.000 34
45 Utah 55 32 1.000 5
46 Vermont 36 57 0.000 3
47 Virginia 44 53 0.022 13
48 Washington 34 55 0.000 11
49 West Virginia 53 44 0.978 5
50 Wisconsin 42 53 0.007 10
51 Wyoming 58 32 1.000 3
Once we have these win probabilities for all states, it is easy to simulate the election. Essentially one flips 51 biased coins where the probability that McCain wins are given by these win probabilities. Once you have simulated the state winners, one can accumulate the electoral votes for the two candidates. I'll focus on the electoral count for Obama since he is predicted to win the election.
I repeated this process for 5000 simulated elections. Here is a histogram of the Obama electoral count. Note that all of the counts exceed 300 indicating that the probability that Obama wins the election is 1.
Friday, October 31, 2008
Monday, June 23, 2008
Variance components model
Here is a simple illustration of an variance components model given by "Dyes" in the WinBUGS 1.4 Examples, volume 1:
******************************************************
Box and Tiao (1973) analyse data first presented by Davies (1967) concerning batch to batch variation in yields of dyestuff. The data (shown below) arise from a balanced experiment whereby the total product yield was determined for 5 samples from each of 6 randomly chosen batches of raw material.
Batch Yield (in grams)
_______________________________________
1 1545 1440 1440 1520 1580
2 1540 1555 1490 1560 1495
3 1595 1550 1605 1510 1560
4 1445 1440 1595 1465 1545
5 1595 1630 1515 1635 1625
6 1520 1455 1450 1480 1445
*******************************************************
Let denote the jth observation in batch i. To determine the relative importance of between batch variation versus sampling variation, we fit the multilevel model.
1. is N()
2. are iid N(0,
3. assigned a uniform prior
In this situation, the focus is on the marginal posterior distribution of . It is possible to analytically integrate out the random effects , resulting in the marginal posterior
density
where is the "within batch" sum of squares for the ith batch. To use the computational algorithms in LearnBayes, we consider the log posterior distribution of
that is programed in the function logpostnorm1:
logpostnorm1=function(theta,y)
{
mu = theta[1]; sigma.y = exp(theta[2]); sigma.b = exp(theta[3])
p.means=apply(y,1,mean); n=dim(y)[2]
like1=-(apply(sweep(y,1,p.means)^2,1,sum))/2/sigma.y^2-n*log(sigma.y)
like2=-(p.means-mu)^2/2/(sigma.y^2/n+sigma.b^2)-.5*log(sigma.y^2/n+sigma.b^2)
return(sum(like1+like2)+theta[2]+theta[3])
}
In the following R code, I load the LearnBayes package and read in the function logpostnorm1.R and the Dyes dataset stored in "dyes.txt".
Then I summarize the posterior by use of the laplace function -- the mode of () is (3.80, 3.79).
> library(LearnBayes)
> source("logpostnorm1.R")
> y=read.table("dyes.txt")
> fit=laplace(logpostnorm1,c(1500,3,3),y)
> fit$mode
[,1] [,2] [,3]
[1,] 1527.5 3.804004 3.787452
******************************************************
Box and Tiao (1973) analyse data first presented by Davies (1967) concerning batch to batch variation in yields of dyestuff. The data (shown below) arise from a balanced experiment whereby the total product yield was determined for 5 samples from each of 6 randomly chosen batches of raw material.
Batch Yield (in grams)
_______________________________________
1 1545 1440 1440 1520 1580
2 1540 1555 1490 1560 1495
3 1595 1550 1605 1510 1560
4 1445 1440 1595 1465 1545
5 1595 1630 1515 1635 1625
6 1520 1455 1450 1480 1445
*******************************************************
Let denote the jth observation in batch i. To determine the relative importance of between batch variation versus sampling variation, we fit the multilevel model.
1. is N()
2. are iid N(0,
3. assigned a uniform prior
In this situation, the focus is on the marginal posterior distribution of . It is possible to analytically integrate out the random effects , resulting in the marginal posterior
density
where is the "within batch" sum of squares for the ith batch. To use the computational algorithms in LearnBayes, we consider the log posterior distribution of
that is programed in the function logpostnorm1:
logpostnorm1=function(theta,y)
{
mu = theta[1]; sigma.y = exp(theta[2]); sigma.b = exp(theta[3])
p.means=apply(y,1,mean); n=dim(y)[2]
like1=-(apply(sweep(y,1,p.means)^2,1,sum))/2/sigma.y^2-n*log(sigma.y)
like2=-(p.means-mu)^2/2/(sigma.y^2/n+sigma.b^2)-.5*log(sigma.y^2/n+sigma.b^2)
return(sum(like1+like2)+theta[2]+theta[3])
}
In the following R code, I load the LearnBayes package and read in the function logpostnorm1.R and the Dyes dataset stored in "dyes.txt".
Then I summarize the posterior by use of the laplace function -- the mode of () is (3.80, 3.79).
> library(LearnBayes)
> source("logpostnorm1.R")
> y=read.table("dyes.txt")
> fit=laplace(logpostnorm1,c(1500,3,3),y)
> fit$mode
[,1] [,2] [,3]
[1,] 1527.5 3.804004 3.787452
Sunday, January 6, 2008
Modeling airline on-time arrival rates
I am beginning to teach a new course on multilevel modeling using a new book by Gelman and Hill.
Here is a simple example of multilevel modeling. The Department of Transportation in May 2007 issued the Air Travel Consumer Report designed to give information to consumers regarding the quality of services of the airlines. For 290 airports across the U.S., this report gives the on-line percentage for arriving flights. Below I've plotted the on-line percentage against the log of the number of flights for these airlines.
What do we notice in this figure? There is a lot of variation in the on-time percentages. Also there variation in the on-line percentages seems to decrease as the number of flights increases.
What explains this variation? There are a couple of causes. First, there are genuine differences in the quality of service at the airports that would cause differences in on-time performance. But also one would expect some natural binomial variability. Even if a particular airport 's planes will be on-time 80% in the long-run, one would expect some variation in the on-time performance of the airport in a short time interval.
In multilevel modeling, we are able to isolate the two types of variation. We are able to model the binomial variability and also model the differences between the true on-time performances of the airports.
To show how multilevel model estimates behavior, I've graphed the estimates in red in the following graph.
I call these multilevel estimates "bayes" in the figure. Note that there are substantial differences between the basic estimates and the multilevel estimates for small airports with a relatively small number of flights.
Here is a simple example of multilevel modeling. The Department of Transportation in May 2007 issued the Air Travel Consumer Report designed to give information to consumers regarding the quality of services of the airlines. For 290 airports across the U.S., this report gives the on-line percentage for arriving flights. Below I've plotted the on-line percentage against the log of the number of flights for these airlines.
What do we notice in this figure? There is a lot of variation in the on-time percentages. Also there variation in the on-line percentages seems to decrease as the number of flights increases.
What explains this variation? There are a couple of causes. First, there are genuine differences in the quality of service at the airports that would cause differences in on-time performance. But also one would expect some natural binomial variability. Even if a particular airport 's planes will be on-time 80% in the long-run, one would expect some variation in the on-time performance of the airport in a short time interval.
In multilevel modeling, we are able to isolate the two types of variation. We are able to model the binomial variability and also model the differences between the true on-time performances of the airports.
To show how multilevel model estimates behavior, I've graphed the estimates in red in the following graph.
I call these multilevel estimates "bayes" in the figure. Note that there are substantial differences between the basic estimates and the multilevel estimates for small airports with a relatively small number of flights.
Subscribe to:
Posts (Atom)