27 Preparing for item analysis

27.1 Item quality

As noted above, item analysis lets us examine the quality of individual test items. Information about individual item quality can help us determine whether or not an item is measuring the content and construct that it was written to measure, and whether or not it is doing so at the appropriate ability level. Because we are discussing item analysis here in the context of CTT, we’ll assume that there is a single construct of interest, perhaps being assessed across multiple related content areas, and that individual items can contribute or detract from our measurement of that construct by limiting or introducing construct irrelevant variance in the form of bias and random measurement error.

Bias represents a systematic error with an influence on item performance that can be attributed to an interaction between examinees and some feature of the test. Bias in a test item leads examinees having a known background characteristic, aside from their ability, to perform better or worse on an item simply because of this background characteristic. For example, bias sometimes results from the use of scenarios or examples in an item that are more familiar to certain gender or ethnic groups. Differential familiarity with item content can make an item more relevant, engaging, and more easily understood, and can then lead to differential performance, even for examinees of the same ability level. We identify such item bias primarily by using measures of item difficulty and differential item functioning (DIF), discussed below and again in Module 31.

Bias in a test item indicates that the item is measuring some other construct besides the construct of interest, where systematic differences on the other construct are interpreted as meaningful differences on the construct of interest. The result is a negative impact on the validity of test scores and corresponding inferences and interpretations. Random measurement error on the other hand is not attributed to a specific identifiable source, such as a second construct. Instead, measurement error is inconsistency of measurement at the item level. An item that introduces measurement error detracts from the overall internal consistency of the measure, and this is detected in CTT, in part, using item analysis statistics.

27.2 Piloting

The goal in developing an instrument or scale is to identify bias and inconsistent measurement at the item level prior to administering a final version of our instrument. As we talk about item analysis, remember that the analysis itself is typically carried out in practice using pilot data. Pilot data are gathered prior to or while developing an instrument or scale. These data require at least a preliminary version of the educational or psychological measure. We’ve written some items for our measure, and we want to see how well they work.

Ferketich (1991) and others recommend that the initial pilot “pool” of candidate test items should be at least twice as large as the final number of items needed. So, if you’re dreaming up a test with 100 items on it, you should pilot at least 200 items. That may not be feasible, but it is a best-case scenario, and should at least be followed in large-scale testing. By collecting data on twice as many items as we intend to actually use, we’re acknowledging that, despite our best efforts, many of our preliminary test items may either be low quality, for example, biased or internally inconsistent, and they may address different ability levels or content than intended.

Ferketich (1991) also recommends that data should be collected on at least 100 individuals from the population of interest. This too may not be feasible, however, it is essential if we hope to obtain results that will generalize to other samples of individuals. When our sample is not representative, for example, when it is a convenience sample or when it contains fewer than 100 people, our item analysis results must be interpreted with caution. This goes back to inferences made based on any type of statistic: small samples leads to erroneous results. Keep in mind that every statistic discussed here has a standard error and confidence interval associated with it, whether it is directly examined or not. Note also that bias and measurement error arise in addition to this standard error or sampling error, and we cannot identify bias in our test questions without representative data from our intended population. Thus, adequate sampling in the pilot study phase is critical.

The item analysis statistics discussed here are based on the CTT model of test performance. In Module 31 we’ll discuss the more complex item response theory (IRT) and its applications in item analysis.

27.3 Data entry

After piloting a set of items, raw item responses are organized into a data frame with test takers in rows and items in columns. The str() function is used here to summarize the structure of the unscored items on the PISA09 reading test. Each unscored item is coded in R as a factor with four to eight factor levels. Each factor level represents different information about a student’s response.

# Recreate the item name index and use it to check the 
# structure of the unscored reading items
# The strict.width argument is optional, making sure the
# results fit in the console window
ritems <- c("r414q02", "r414q11", "r414q06", "r414q09",
  "r452q03", "r452q04", "r452q06", "r452q07", "r458q01",
  "r458q07", "r458q04")
str(PISA09[, ritems], strict.width = "cut")
#> 'data.frame':    44878 obs. of  11 variables:
#>  $ r414q02: Factor w/ 7 levels "1","2","3","4",..: 2 1 4 1 1 2 2 2 3 2 ...
#>  $ r414q11: Factor w/ 7 levels "1","2","3","4",..: 4 1 1 1 3 1 1 3 1 1 ...
#>  $ r414q06: Factor w/ 5 levels "0","1","8","9",..: 1 4 2 1 2 2 2 4 1 1 ...
#>  $ r414q09: Factor w/ 8 levels "1","2","3","4",..: 3 7 4 3 3 3 3 5 3 3 ...
#>  $ r452q03: Factor w/ 5 levels "0","1","8","9",..: 1 4 1 1 1 2 2 1 1 1 ...
#>  $ r452q04: Factor w/ 7 levels "1","2","3","4",..: 4 6 4 3 2 2 2 1 2 2 ...
#>  $ r452q06: Factor w/ 4 levels "0","1","9","r": 1 3 2 1 2 2 2 2 1 2 ...
#>  $ r452q07: Factor w/ 7 levels "1","2","3","4",..: 3 6 3 1 2 4 4 4 2 4 ...
#>  $ r458q01: Factor w/ 7 levels "1","2","3","4",..: 4 4 4 3 4 4 3 4 3 3 ...
#>  $ r458q07: Factor w/ 4 levels "0","1","9","r": 1 3 2 1 1 2 1 2 2 2 ...
#>  $ r458q04: Factor w/ 7 levels "1","2","3","4",..: 2 3 2 3 2 2 2 3 3 4 ...

In addition to checking the structure of the data, it’s good practice to run frequency tables on each variable. An example is shown below for a subset of PISA09 reading items. The frequency distribution for each variable will reveal any data entry errors that resulted in incorrect codes. Frequency distributions should also match what we expect to see for correct and incorrect response patterns and missing data.

PISA09 items that include a code or factor level of “0” are constructed-response items, scored by raters. The remaining factor levels for these CR items are coded “1” for full credit, “7” for not administered, “9” for missing, and “r” for not reached, where the student ran out of time before responding to the item. Selected-response items do not include a factor level of “0.” Instead, they contain levels “1” through up to “5,” which correspond to multiple-choice options one through five, and then codes of “7” for not administered, “8” for an ambiguous selected response, “9” for missing, and “r” again for not reached.

# Subsets of the reading item index for constructed and
# selected items
# Check frequency tables by item (hence the 2 in apply)
# for CR items
critems <- ritems[c(3, 5, 7, 10)]
sritems <- ritems[c(1:2, 4, 6, 8:9, 11)]
apply(PISA09[, critems], 2, table, exclude = NULL)
#>   r414q06 r452q03 r452q06 r458q07
#> 0    9620   33834   10584   12200
#> 1   23934    5670   22422   25403
#> 9   10179    4799   11058    6939
#> r    1145     575     814     336

In the piloting and data collection processes, response codes or factor levels should be chosen carefully to represent all of the required response information. Responses should always be entered in a data set in their most raw form. Scoring should then happen after data entry, through the creation of new variables, whenever possible.

27.4 Scoring

In Module ??, which covered measurement, scales, and scoring, we briefly discussed the difference between dichotomous and polytomous scoring. Each involves the assignment of some value to each possible observed response to an item. This value is taken to indicate a difference in the construct underlying our measure. For dichotomous items, we usually assign a score of 1 to a correct response, and a zero otherwise. Polytomous items involve responses that are correct to differing degrees, for example, 0, 1, and 2 for incorrect, somewhat correct, and completely correct.

In noncognitive testing, we replace “correctness” from the cognitive context with “amount” of the trait or attribute of interest. So, a dichotomous item might involve a yes/no response, where “yes” is taken to mean the construct is present in the individual, and it is given a score of 1, whereas “no” is taken to mean the construct is not present, and it is given a score of 0. Polytomous items then allow for different amounts of the construct to be present.

Although it seems standard to use dichotomous 0/1 scoring, and polytomous scoring of 0, 1, 2, ect., these values should not be taken for granted. The score assigned to a particular response determines how much a given item will contribute to any composite score that is later calculated across items. In educational testing, the typical scoring schemes are popular because they are simple. Other scoring schemes could also be used to given certain items more or less weight when calculating the total.

For example, a polytomous item could be scored using partial credit, where incorrect is scored as 0, completely correct is given 1, and levels of correctness are assigned decimal values in between. In psychological testing, the center of the rating scale could be given a score of 0, and the tails could decrease and increase from there. For example, if a rating scale is used to measure levels of agreement, 0 could be assigned to a “neutral” rating, and -2 and -1 might correspond to “strongly disagree” and “disagree,” with 1 and 2 corresponding to “agree” and “strongly agree.” Changing the values assigned to item responses in this way can help improve the interpretation of summary results.

Scoring item responses also requires that direction, that is, decreases and increases, be given to the correctness or amount of trait involved. Thus, at the item level, we are at least using an ordinal scale. In cognitive testing, the direction is simple: increases in points correspond to increases in correctness. In psychological testing, reverse scoring may also be necessary.

PISA09 contains examples of scoring for both educational and psychological measures. First, we’ll check the scoring for the CR and SR reading items. A crosstab for the raw and scored versions of an item shows how each code was converted to a score. Note students not reaching an item, with an unscored factor level “r,” were given an NA for their score.

# Indices for scored reading items
rsitems <- paste0(ritems, "s")
crsitems <- paste0(critems, "s")
srsitems <- paste0(sritems, "s")
# Tabulate unscored and scored responses for the first CR
# item
# exclude = NULL shows us NAs as well
# raw and scored are not arguments to table, but are used
# simply to give labels to the printed output
table(raw = PISA09[, critems[1]],
  scored = PISA09[, crsitems[1]],
  exclude = NULL)
#>    scored
#> raw     0     1  <NA>
#>   0  9620     0     0
#>   1     0 23934     0
#>   8     0     0     0
#>   9 10179     0     0
#>   r     0     0  1145
# Create the same type of table for the first SR item

For a psychological example, we revisit the attitude toward school items presented in Figure ??. In PISA09, these items were coded during data entry with values of 1 through 4 for “Strongly Disagree” to “Strongly Agree.” We could utilize these as the scored responses in the item analyses that follow. However, we first need to rescore the two items that were worded in the opposite direction as the others. Then, higher scores on all four items will represent more positive attitudes toward school.

# Check the structure of raw attitude items
str(PISA09[, c("st33q01", "st33q02", "st33q03",
  "st33q04")])
#> 'data.frame':    44878 obs. of  4 variables:
#>  $ st33q01: num  3 3 2 1 2 2 2 3 2 3 ...
#>  $ st33q02: num  2 2 1 1 2 2 2 2 1 2 ...
#>  $ st33q03: num  2 1 3 3 3 3 1 3 1 3 ...
#>  $ st33q04: num  3 3 4 3 3 3 3 2 1 3 ...
# Rescore two items
PISA09$st33q01r <- rescore(PISA09$st33q01)
PISA09$st33q02r <- rescore(PISA09$st33q02)