17 Criterion validity

17.1 Definition

Criterion validity is the degree to which test scores correlate with, predict, or inform decisions regarding another measure or outcome. If you think of content validity as the extent to which a test correlates with or corresponds to the content domain, criterion validity is similar in that it is the extent to which a test correlates with or corresponds to another test. So, in content validity we compare our test to the content domain, and hope for a strong relationship, and in criterion validity we compare our test to a criterion variable, and again hope for a strong relationship.

Validity by association

The keyword in this definition of criterion validity is correlate, which is synonymous with relate or predict. The assumption here is that the construct we are hoping to measure with our test is known to be measured well by another test or observed variable. This other test or variable is often referred to as a “gold standard,” a label presumably given to it because it is based on strong validity evidence. So, in a way, criterion validity is a form of validity by association. If our test correlates with a known measure of the construct, we can be more confident that our test measures the same construct.

The equation for a validity coefficient is the same as the equations for correlation that we encountered in previous modules. Here we denote our test as \(X\) and the criterion variable as \(Y\). The validity coefficient is the correlation between the two, which can be obtained as the covariance divided by the product of the individual standard deviations.

\[\begin{equation} \rho_{XY} = \frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}} \tag{17.1} \end{equation}\]

Criterion validity is sometimes distinguished further as concurrent validity, where our test and the criterion are administered concurrently, or predictive validity, where our test is measured first and can then be used to predict the future criterion. The distinction is based on the intended use of scores from our test for predictive purposes.

Criterion validity is limited because it does not actually require that our test be a reasonable measure of the construct, only that it relate strongly with another measure of the construct. Nunnally and Bernstein (1994) clarify this point with a hypothetical example:

If it were found that accuracy in horseshoe pitching correlated highly with success in college, horseshoe pitching would be a valid measure for predicting success in college.

The scenario is silly, but it highlights the fact that, on it’s own, criterion validity is insufficient. The take-home message is that you should never use or trust a criterion relationship as your sole source of validity evidence.

There are two other challenges associated with criterion validity. First, finding a suitable criterion can be difficult, especially if your test targets a new or not well defined construct. Second, a correlation coefficient is attenuated, or reduced in strength, by any unreliability present in the two measures being correlated. So, if your test and the criterion test are unreliable, a low validity coefficient (the correlation between the two tests) may not necessarily represent a lack of relationship between the two tests. It may instead represent a lack of reliable information with which to estimate the criterion validity coefficient.

Attenuation

Here’s a demonstration of how attenuation works, based on PISA. Suppose we find a gold standard criterion measure of reading ability, and administer it to students in the US who took the reading items in PISA09. First, we calculate a total score on the PISA reading items, then we compare it to some simulated test scores for our criterion test. Scores have been simulated to correlate at 0.80.

# Get the vector of reading items names
ritems <- c("r414q02", "r414q11", "r414q06", "r414q09", 
  "r452q03", "r452q04", "r452q06", "r452q07", "r458q01", 
  "r458q07", "r458q04")
rsitems <- paste0(ritems, "s")
# Calculate total reading scores
pisausa <- PISA09[PISA09$cnt == "USA", rsitems]
rtotal <- rowSums(pisausa, na.rm = TRUE)
# Simulate a criterion
# using rsim
criterion <- rsim(rho = .8, x = rtotal, meany = 24,
  sdy = 6)
# Check the correlation
cor(rtotal, criterion$y)
#> [1] 0.804

Suppose the internal consistency reliability for our criterion is 0.86. We know from Module 38.3 that internal consistency for the PISA09 reading items is about 0.77. With a simple formula, we can calculate what the validity coefficient should be for our two measures, if each measure were perfectly reliable. Here, we denote this disattenuated correlation as the correlation between true scores on \(X\) and \(Y\).

\[\begin{equation} \rho_{T_X T_Y} = \frac{\rho_{XY}}{\sqrt{\rho_{X}\rho_{Y}}} \tag{17.2} \end{equation}\]

Correcting for attenuation due to measurement error produces a validity coefficient of 0.99. This is a noteworthy increase from the original correlation of 0.80.

# Internal consistency for the PISA items
epmr::coef_alpha(pisausa)
#> $alpha
#> [1] 0.774
#> 
#> $q
#> NULL
#> 
#> $se
#> NULL
#> 
#> $ci
#> NULL
#> 
#> $sigma
#>          r414q02s r414q11s r414q06s r414q09s r452q03s r452q04s r452q06s
#> r414q02s   0.2499   0.0464   0.0777   0.0560   0.0325   0.0492   0.0736
#> r414q11s   0.0464   0.2454   0.0526   0.0359   0.0165   0.0287   0.0421
#> r414q06s   0.0777   0.0526   0.2492   0.0712   0.0299   0.0696   0.0971
#> r414q09s   0.0560   0.0359   0.0712   0.2112   0.0210   0.0447   0.0747
#> r452q03s   0.0325   0.0165   0.0299   0.0210   0.1092   0.0257   0.0393
#> r452q04s   0.0492   0.0287   0.0696   0.0447   0.0257   0.2371   0.0777
#> r452q06s   0.0736   0.0421   0.0971   0.0747   0.0393   0.0777   0.2486
#> r452q07s   0.0740   0.0489   0.0820   0.0506   0.0431   0.0648   0.0831
#> r458q01s   0.0627   0.0356   0.0805   0.0524   0.0325   0.0554   0.0728
#> r458q07s   0.0634   0.0395   0.0856   0.0686   0.0304   0.0628   0.0825
#> r458q04s   0.0532   0.0396   0.0505   0.0347   0.0246   0.0497   0.0703
#>          r452q07s r458q01s r458q07s r458q04s
#> r414q02s   0.0740   0.0627   0.0634   0.0532
#> r414q11s   0.0489   0.0356   0.0395   0.0396
#> r414q06s   0.0820   0.0805   0.0856   0.0505
#> r414q09s   0.0506   0.0524   0.0686   0.0347
#> r452q03s   0.0431   0.0325   0.0304   0.0246
#> r452q04s   0.0648   0.0554   0.0628   0.0497
#> r452q06s   0.0831   0.0728   0.0825   0.0703
#> r452q07s   0.2466   0.0649   0.0678   0.0508
#> r458q01s   0.0649   0.2478   0.0765   0.0462
#> r458q07s   0.0678   0.0765   0.2410   0.0512
#> r458q04s   0.0508   0.0462   0.0512   0.2501
#> 
#> $n
#> [1] 1611
#> 
#> $ni
#> [1] 11
# Correction for attenuation
cor(rtotal, criterion$y)/sqrt(.77 * .86)
#> [1] 0.988

In summary, the steps for establishing criterion validity evidence are relatively simple. After defining the purpose of the test, a suitable criterion is identified. The two tests are administered to the same sample of individuals from the target population, and a correlation is obtained. If reliability estimates are available, we can then estimate the disattenuated coefficient, as shown above.

Note that a variety of other statistics are available for establishing the predictive power of a test \(X\) for a criterion variable \(Y\). Two popular examples are regression models, which provide more detailed information about the bivariate relationship between our test and criterion, and contingency tables, which describe predictions in terms of categorical or ordinal outcomes. In each case, criterion validity can be maximized by writing items for our test that are predictive of or correlate with the criterion.

17.2 Criterion examples

A popular example of criterion validity is the GRE, which has come up numerous times in this book. The GRE is designed to predict performance in graduate school. Admissions programs use it as one indicator of how well you are likely to do as a graduate student. Given this purpose, what is a suitable criterion variable that the GRE should predict? And how strong of a correlation would you expect to see between the GRE and this graduate performance variable?

The simplest criterion for establishing criterion-related validity evidence for the GRE would be some measure of performance or achievement in graduate school. First-year graduate GPA is a common choice. The GRE has been shown across numerous studies to correlate around 0.30 with first-year graduate GPA. A correlation of 0.30 is evidence of a small positive relationship, suggesting that most of the variability in GPA, our criterion, is not predicted by the GRE (91% to be precise). In other words, many students score high or low on the GRE and do not have a similarly high or low graduate GPA.

Although this modest correlation may at first seem disappointing, a few different considerations suggest that it is actually pretty impressive. First, GPA is likely not a reliable measure of graduate performance. It’s hardly a “gold standard.” Instead, it’s the best we have. It’s one of the few quantitative measures available for all graduate students. The correlation of 0.30 is likely attenuated due to, at least, measurement error in the criterion. Second, there is likely some restriction of range happening in the relationship between GRE and GPA. People who score low on the GRE are less likely to get into graduate school, so their data are not represented. Restriction of range tends to reduce correlation coefficients. Third, what other measure of pre-graduate school performance correlates at 0.30 with graduate GPA? More importantly, what other measure of pre-graduate school performance that only takes a few hours to obtain correlates at 0.30 with graduate GPA? In conclusion, the GRE isn’t perfect, but as far as standardized predictors go, it’s currently the best we’ve got. In practice, admissions programs need to make sure they don’t rely too much on it in admissions decisions, as discussed in Module 36.

Note that a substantial amount of research has been conducted documenting predictive validity evidence for the GRE. See Kuncel, Hezlett, and Ones (2001) for a meta-analysis of results from this literature.