11 Statistical definition of reliability.
In CTT, reliability is defined as the proportion of variability in \(X\) that is due to variability in true scores \(T\):
\[\begin{equation} r = \frac{\sigma^2_T}{\sigma^2_X}. \tag{11.1} \end{equation}\]
11.1 Estimating reliability
One indirect estimate made possible by CTT is the correlation between scores on two forms of the same test:
\[\begin{equation} r = \rho_{X_1 X_2} = \frac{\sigma_{X_1 X_2}}{\sigma_{X_1} \sigma_{X_2}}. \tag{11.2} \end{equation}\]
11.2 Split-half method
The split-half method takes scores on a single test form, and separates them into scores on two halves of the test, which are treated as separate test forms. The correlation between these two halves then represents an indirect estimate of reliability, based on Equation (11.1).
# Split half correlation, assuming we only had scores on
# one test form
# With an odd number of reading items, one half has 5
# items and the other has 6
library(epmr)
<- rowSums(PISA09[PISA09$cnt == "BEL",
xsplit1 1:5]])
rsitems[<- rowSums(PISA09[PISA09$cnt == "BEL",
xsplit2 6:11]])
rsitems[
cor(xsplit1, xsplit2, use = "complete")
#> [1] 0.625
11.3 Spearman Brown
The Spearman-Brown formula was originally used to correct for the reduction in reliability that occurred when correlating two test forms that were only half the length of the original test. In theory, reliability will increase as we add items to a test. Thus, Spearman-Brown is used to estimate, or predict, what the reliability would be if the half-length tests were made into full-length tests.
# sb_r() in the epmr package uses the Spearman-Brown
# formula to estimate how reliability would change when
# test length changes by a factor k
# If test length were doubled, k would be 2
sb_r(r = cor(xsplit1, xsplit2, use = "complete"), k = 2)
#> [1] 0.769
The Spearman-Brown reliability, \(r_{new}\), is estimated as a function of what’s labeled here as the old reliability, \(r_{old}\), and the factor by which the length of \(X\) is predicted to change, \(k\):
\[\begin{equation} r_{new} = \frac{kr_{old}}{(k - 1)r_{old} + 1}. \tag{11.3} \end{equation}\]
Again, \(k\) is the factor by which the test length is increased or decreased. It is equal to the number of items in the new test divided by the number of items in the original test. Multiply \(k\) by the old reliability, and then divided the result by \((k - 1)\) times the old reliability, plus 1. The epmr package contains sb_r()
, a simple function for estimating the Spearman-Brown reliability.
11.4 Standard Error of Measurement
Typically, we’re more interested in how the unreliability of a test can be expressed in terms of the available observed variability. Thus, we multiply the unreliable proportion of variance by the standard deviation of \(X\) to obtain the SEM:
\[\begin{equation} SEM = \sigma_X\sqrt{1 - r}. \tag{11.4} \end{equation}\]
Confidence intervals for PISA09
can be estimated in the same way. First, we choose a measure of reliability, find the SD of observed scores, and obtain the corresponding SEM. Then, we can find the CI, which gives us the expected amount of uncertainty in our observed scores due to random measurement error. Here, we’re calculating SEM and the CI using alpha, but other reliability estimates would work as well. Figure 11.1 shows the 11 possible PISA09
reading scores in order, with error bars based on SEM for students in Belgium.
# Get alpha and SEM for students in Belgium
<- coef_alpha(PISA09[PISA09$cnt == "BEL", rsitems])
bela =bela$alpha
bela_alpha# The sem function from epmr also overlaps with sem from
# another package so we're spelling it out here in long
# form
<- epmr::sem(r = bela_alpha, sd = sd(scores$x1,
belsem na.rm = TRUE))
# Plot the 11 possible total scores against themselves
# Error bars are shown for 1 SEM, giving a 68% confidence
# interval and 2 SEM, giving the 95% confidence interval
# x is converted to factor to show discrete values on the
# x-axis
<- data.frame(x = 1:11, sem = belsem)
beldat ggplot(beldat, aes(factor(x), x)) +
geom_errorbar(aes(ymin = x - sem * 2,
ymax = x + sem * 2), col = "violet") +
geom_errorbar(aes(ymin = x - sem, ymax = x + sem),
col = "yellow") +
geom_point()
Figure 11.1 helps us visualize the impact of unreliable measurement on score comparisons. For example, note that the top of the 95% confidence interval for \(X\) of 2 extends nearly to 5 points, and thus overlaps with the CI for adjacent scores 3 through 7. It isn’t until \(X\) of 8 that the CI no longer overlap. With a CI of belsem
1.425, we’re 95% confident that students with observed scores differing at least by belsem * 4
5.7 have different true scores. Students with observed scores closer than belsem * 4
may actually have the same true scores.