Sunday, November 11, 2007

Looking for True Streakiness

There is a lot of interest in streaky behavior in sports. One observes players or teams that appear streaky with the implicit conclusion that this says something about the character of the athlete.

Eric Byrnes had 412 opportunities to hit during the 2005 baseball season. Here is his sequence of hits (successes) and outs (failures) during the season.

[1] 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0
[38] 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0
[75] 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
[112] 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0
[186] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0
[260] 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0
[334] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[371] 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 0
[408] 0 0 0 0 0

One way of seeing the streaky behavior in this sequence is by a moving average graph where one plots the success rate (batting average) for windows of 40 at-bats. I wrote a short program mavg.R to compute the moving averages. The following R code plots the moving averages and plots a lowess smooth on top to help see the pattern.

MAVG=mavg(byrne$x,k=40)
plot(MAVG,type="l",lwd=2,col="red",xlab="GAME",ylab="AVG",
main="ERIC BYRNES")

We some interesting patterns.. It seemed that Byrnes had a cold spell in the first part of the season, followed by a hot period, and then a very cold period.

The interesting question is: is this streaky pattern "real" or is it just a byproduct of bernoulli chance variation?

We answer this question by means of a Bayes factor. Suppose we partition Byrnes' 412 at-bats into groups of 20 at-bats. We observe counts y1, ..., yn, where yi is the number of hits in the ith group. Suppose yi is binomial(20, pi) where pi is the probability of a hit in the ith period.

We define two hypotheses:

H (not streaky) the probabilities across periods are equal, p1 = ... = pn = p

A (streaky) the probabilities across periods vary according to a beta distribution with mean eta and precision K. This model is indexed by the parameter K.

The functions bfexch and laplace in the LearnBayes package can be used to compute a Bayes factor in support of A over H. Here is how we do it.

1. The raw data is in the matrix BRYNE -- the first column contains the data (0's and 1's) and the second column contains the attempts (column of 1's). We regroup the data into periods of 20 at-bats using the regroup function.

regroup(BRYNE, 20)

2. The following R function laplace.int will compute the log, base10 of the Bayes factor in support of streakiness for a fixed value of log(K).

laplace.int=function(logK,data=data1)
log10(exp(laplace(bfexch,0,list(data=data,K=exp(logK)))$int))

To illustrate, suppose we want to compute the log10 Bayes factor for our data for logK = 3:

> laplace.int(3,regroup(BRYNE,20))
[,1]
[1,] 1.386111

This indicates support for streakiness -- the log Bayes factor is 1.38 which means that A is over 10 times more likely than H.

3. Generally we'd like to compute the log10 Bayes factor for a sequence of values of log K. I first write a simple function that does this:

s.laplace.int=function(logK,data)
list(x=logK,y=sapply(logK,laplace.int,data))

and then I use this function to compute the Bayes factor for values of log K from 2 to 6 in steps of 0.2. I use the plot command to graph these values. I draw a line at the value log10 BF = 0 -- this corresponds to the case where neither model is supported.

plot(s.laplace.int(seq(2,6,by=.2),regroup(BRYNE,20)),type="l",
xlab="LOG K", ylab="LOG 10 BAYES FACTOR", lwd=3, col="red", ylim=c(-3,2))
lines(c(1,7),c(0,0),lwd=3,col="blue")
title(main="ERIC BYRNES")

What we see that, for a range of values of K, the Bayes factor favors the model A by a factor of 10 or more.

Actually we only looked at Eric Byrnes since he exhibited unusually streaky behavior during this 2005 season. What if we look at other players? Here are the Bayes factors graphs for the hitting data for two other players Chase Utley and Damian Miller (we are grouping the data in the same way).


Here for both players, note that the log10 Bayes factors are entirely negative for the range of K values. For both players, there is support for the non-streaky model H. One distinctive features of Bayes factors is that they can provide support for the null or the alternative hypothesis.

No comments: