Last Updated: October 23, 2008
Results are computed for each calendar date, from January 1 to the present. For any chosen date "SimDate", the probability of Obama and McCain winning each individual state is computed from the polls as follows:
Polls for each state are selected from the data available at Pollster.
Only polls that were completed on or before SimDate are included. Polls that were completed after SimDate are ignored.
Polls that do not report the number of respondents "N" are not used. Polls by Zogby Interactive, Columbus Dispatch and certain other pollster are also excluded due to their questionable or unknown polling practices and a track record of inaccurate results.
Subject to the exceptions listed above (none from Zogby, etc. and must report an "N"), the most recent poll for the state is selected, based on the poll end date.
All polls that were completed within six days of the most recent poll are also included, again subject to the exceptions listed above.
If the number of polls is less than 6, the time window is expanded one day at a time until either 6 or more polls are included or the total time span reaches 75 days.
The following dates are considered potential inflection points in the race: the end of the DNC, the end of the RNC, each of the Presidential debates, the Vice-Presidential debate, and the start of the last week of the campaign. If the most recent poll was completed after one of the inflection points, then polls competed before the inflection point are only used when there are less than 3 polls after the inflection point.
The last date on which a poll is used is reported in the "Used Thru" column.
Let Ni = # of reported respondents in one poll; Pi = % respondents for Obama; Qi = % respondents for McCain.
The polls are first normalized by computing ni=Ni*(Pi+Qi) and pi=Pi/(Pi+Qi). Note that qi=1-pi and does not need to be computed.
A weighting factor is determined for each poll based on the poll age relative to the most recent poll. If the poll is no more than 5 days older than the most recent poll, its weighting factor is unity. Older polls are weighted to have a half-life of 21 days according to the following formula: wi = 0.5^((a-5)/21) where a is the number of days the poll is older than the most recent. In addition, if the included polls straddle an inflection point, then the polls prior to the inflection point are weighted by an additional factor of 2/3 when there is one poll after the inflection point, or an additional factor of 1/3 when there are two polls after the inflection point.
The following weighted statistics are computed: Let nn = # polls averaged. N=sum(wi*Ni); n=sum(wi*ni); p=sum(wi*ni*pi)/n; se^2=p*(1-p)/n; pd^2=sum(wi*ni*(pi-p)^2)/sn*nn/(nn-1). p represents the weighted average respondents favoring Obama out of the total respondents favoring either Obama or McCain. This is an unbiased estimator of the same quantity in the population as a whole. "se" represents the sampling error. It forms a lower bound on the estimate variation. "pd^2" represents the weighted variance and is a good (and possibly unbiased estimator) of the poll variance. The poll variance includes both the sample variance and any other unbiased errors. These figures are reported on the state details page as "Polls" = nn; "Sum(wN)" = N; "McCain" = (1-p)*N/n; "Obama" = p*N/n; "Std_Err" = se; and "Poll_Dev" = pd
The probability of each candidate winning the state is computed as P(Obama)=cdf((p-q)/sd) and P(McCain)=1-P(Obama) where cdf is the cumulative normal distribution function, q=1-p, and nd=sqrt((2*sd)^2 + 0.0175^2), where sd=max(se,min(pd,0.04)). This measures the probability that p (respondents for Obama) is greater than q (respondents for McCain). The standard deviation in p is estimated by sd. In computing sd, the poll deviation pd is capped at 0.04, since larger values tend to be caused by actual shifts in opinion rather than polling errors. The standard deviation in the difference p-q is 2*sd, since p-q=p-(1-p)=2p-1. The figure of 0.0175 represents the biased error in the poll results and is estimated by computing the amount by which the Real Clear Politics 2004 Battleground State poll averages differed from the actual results. Because the unbiased and biased errors are independent, their variances add. These figures are reported on the state details page as "Net_Dev" = nd and "P(Obama)" = P(Obama).
Note that the above methodology works well for real-world poll data. In states where the race is competitive, pollsters tend to do a lot of polling, providing many recent polls that can be averaged. For less competitive states, there are fewer polls, and therefore it is necessary to use polling data that goes further back. The polls are weighted based on their age relative to the most recent poll (rather than their age relative to SimDate) so that the results do not change over time when there are no polls.
The above methodology works well except for one "pathological" case. There can be a situation for example where McCain is ahead by a reasonable margin, and the computed probabilities of winning the state are McCain 80% Obama 20%. A new poll will come out that shows McCain ahead by an even larger margin than the prior polls. This new poll causes the average poll results for McCain to improve, however, since it differs from the prior polls (it is essentially an outlier), it also increases the poll variance, resulting in a mathematical increase in the poll uncertainty and a decrease in the estimated probability of McCain winning. This is a non-intuitive result: a poll that is even better for the candidate than all of the prior polls actually causes his computed probability of winning the state to drop. This may happen because the new poll is in fact an outlier, or it may happen when there has been an actual shift in public opinion that has not yet been captured in enough recent polls to be clear.
In order to avoid this result, each of the individual poll points that make up with average are examined individually. If the poll result should only help the leading candidate (it is above the average), the probabilities of the candidates are recomputed without that poll (it is dropped from both the average and the variance). If new probabilities show a decrease rather than an increase for the candidate, then the poll is dropped from the computations on that date. If there is more than one outlier, then only one is dropped at a time and the outlier check is repeated with the new probabilities. The total number of data points dropped is reported in the on the state details page as "Dropped". The reported "Used Thru" dates of the polls are not changed.
The winner in each state is determined by a random number generator. The potential for a systematic in the poll averages is modeled by adding state-to-state correlation of 0.10 to the random number generator using the formula Ri = r0 * corr + r1i * (1 - corr) where r0 and r1i are both uniform random numbers from [0,1], r0 is the same for all states in each simulation iteration while r1i is unique for each state and corr = 0.2437. R is compared to PAi = F(pi) where pi is the probability of Obama winning the state and F is the inverse of the convolution of the joint probability density of r0 and r1i. The function F(pi) is necessary to keep the probability of Obama winning the state equal to pi.
The BattleRank is equal to P[McCain wins state] * (P[McCain wins election when McCain wins state] - P[MCcain wins election]) + P[Obama wins state] * (P[Obama wins election when Obama wins state] - P[Obama wins election]).
The interactive simulation uses the same methodology described above.