12 Interpreting reliability and unreliability
There are no agreed-upon standards for interpreting reliability coefficients. Reliability is bound by 0 on the lower end and 1 at the upper end, because, by definition, the amount of true variability can never be less or more than the total available variability in \(X\). Higher reliability is clearly better, but cutoffs for acceptable levels of reliability vary for different fields, situations, and types of tests. The stakes of a test are an important consideration when interpreting reliability coefficients. The higher the stakes, the higher we expect reliability to be. Otherwise, cutoffs depend on the particular application.
In general, reliabilities for educational and psychological tests can be interpreted using scales like the ones presented in Table 12.1. With medium-stakes tests, a reliability of 0.70 is sometimes considered minimally acceptable, 0.80 is decent, 0.90 is quite good, and anything above 0.90 is excellent. High stakes tests should have reliabilities at or above 0.90. Low stakes tests, which are often simpler and shorter than higher-stakes ones, often have reliabilities as low as 0.70. These are general guidelines, and interpretations can vary considerably by test. Remember that the cognitive measures in PISA would be considered low-stakes at the student level.
A few additional considerations are necessary when interpreting coefficient alpha. First, alpha assumes that all items measure the same single construct. Items are also assumed to be equally related to this construct, that is, they are assumed to be parallel measures of the construct. When the items are not parallel measures of the construct, alpha is considered a lower-bound estimate of reliability, that is, the true reliability for the test is expected to be higher than indicated by alpha. Finally, alpha is not a measure of dimensionality. It is frequently claimed that a strong coefficient alpha supports the unidimensionality of a measure. However, alpha does not index dimensionality. It is impacted by the extent to which all of the test items measure a single construct, but it does not necessarily go up or down as a test becomes more or less unidimensional.
Reliability | High Stakes Interpretation | Low Stakes Interpretation |
---|---|---|
\(\geq 0.90\) | Excellent | Excellent |
\(0.80 \leq r < 0.90\) | Good | Excellent |
\(0.70 \leq r < 0.80\) | Acceptable | Good |
\(0.60 \leq r < 0.70\) | Borderline | Acceptable |
\(0.50 \leq r < 0.60\) | Low | Borderline |
\(0.20 \leq r < 0.50\) | Unacceptable | Low |
\(0.00 \leq r < 0.20\) | Unacceptable | Unacceptable |