BIP 6024

BIP 6024
;)

Thursday, 1 May 2014

STATISTICS III - TEST ITEM ANALYSIS



TEST ITEM ANAYSIS


One of the tools used in the evaluation process is an item analysis. It is used to "Test the Test". It ensures testing instruments measure the required behaviors needed by the learners to perform a task to standard. When evaluating tests we need to ask the question: Do the scores on the test provide information that is really useful and accurate in evaluating student performance? The item analysis provides information about the reliability and validity of test items and learner performance. Item Analysis has two purposes: First, to identify defective test items and secondly, to pinpoint the learning materials (content) the learners have and have not mastered, particularly what skills they lack and what material still causes them difficulty (Brown & Frederick, 1971).

Item Analysis is performed by comparing the proportion of learners who pass an test item in contrasting criterion groups. That is, for each question on a test, how many learners with the highest test scores (U) answered the question correctly or incorrectly compared with the learners who had the lowest test scores (L)? 


NOTE: With the large and normally distributed samples used in the development of standardized tests, it is customary to work with the upper and lower 27 percent of the criterion distribution. Many of the tables used for the computation of item validity indices are based on the assumption that the "27 percent rule" has been followed. Also, if the total sample contains 370 cases, the U and L groups will each include exactly 100 cases, thus preventing the necessity of computing percentages. For this reason it is desirable in a large test item analysis to use a sample of 370 persons.
Because item analysis is often done with small classroom size groups, a simple procedure will be used here. This simple analysis uses a percentage of 33 percent to divide the class in three groups, Upper (U), Middle (M), and Lower (L). An example will be used for this discussion. In a class of 30 students we have chosen 10 students (33 percent) with the highest scores and 10 students (33 percent) with the lowest scores. We now have three groups: U, M, and L. The test has 10 items in it.
Next, we tally the correct responses to each item given by the students in the three groups. This can easily be done by listing the item numbers in one column and prepare three other columns, named U, M, L. As we go through each student's paper, we place a tally mark next to each item that was answered correctly. This is done for each of the ten test papers in the U group, then each of the ten test papers in the M group, and finally for each of the ten papers in the L group. The tallies are then counted and recorded for each group as shown in the table below.

Item Analysis Table

A measure of item Difficulty is obtained by adding the number passing each item in all three criterion groups (U + M + L) as shown in the fifth column. A rough index of the validity or discriminative value of each item is found by subtracting the number of persons answering it correctly in the L group from the number answering it correctly in the U group (L - U) as shown in the sixth column.
  • Item 2 show a low difficulty level. It might be too easy, having been passed by 29 out of 30 learners. If the test item is measuring a valid performance standard, then it could still be an excellent test item.
  • Item 4 shows a negative value. Apparently, something about the question or one of the distracters confused the U group, since a greater number of them marked it wrong than the L group. Some of the elements to look for are: wording of the question, double negatives, incorrect terms, distracters that could be consider right, or text that differs from the instructional material.
  • Item 5 shows a zero discriminative value. A test item of this nature with a good difficulty rating might still be a valid test item, but other factors should be checked. i.e. Was a large number of the U group missing from training when this point was taught? Was the L group given additional training that could also benefit the U group?
  • Item 7 show a high difficulty level. The training program should be checked to see if this point was sufficiently covered by the trainers or if a different type of learning presentation should be developed.
  • Item 9 shows a negative value. The high value of the negative number probably indicates a test item that was incorrectly keyed.
As you can see, the item analysis identifies deficiencies either in the test or in the instruction. Discussing questionable items with the class is often sufficient to diagnose the problem. In narrowing down the source of difficulty, it is often helpful to carry out further analysis of each test item. The table below shows the number of learners in the three groups who choose each option in answering the particular items. For brevity, only the first three test items are shown. The correct answers are marked with an *. 

Item Analysis Table

This analysis could be done with just the items that were chosen for further examination, or the complete test. You might wonder why perform another analysis for the complete test if most of the test items proved valid in the first one. The answer is to see how well the distracters performed their job. To illustrate this, look at the distracters chosen for item 1. Although the first analysis showed this the be a valid test item, of the distracters chosen by the learners, only A and B we used. Nine learners choose distracter B, seven learners choose distracter C, while none choose distracter D. This distracter needs to be made more realistic or eliminated from the test item. This type of analysis helps us to further refine the testing instrument.







 


Statistics III - Index of Discrimination


Interpreting the Index of Discrimination

The index of discrimination is a useful measure of item quality whenever the purpose of a test is to produce a spread of scores, reflecting differences in student achievement, so that distinctions may be made among the performances of examinees. This is likely to be the purpose of norm-referenced tests.

For the subset of criterion-referenced tests known as mastery model tests, we desire that all examinees score as high as possible. We do not wish to distinguish among examinees who score at mastery level and therefore are not interested in maximizing test score variance. In such cases the index of discrimination is not useful and other measures, such as sensitivity to instruction, are used to judge item quality.

A basic consideration in evaluating the performance of a normative test item is the degree to which the item discriminates between high achieving students and low achieving students. Literally dozens of indices have been developed to express the discriminating ability of test items. Most empirical studies have shown that nearly identical sets of items are selected regardless of the indices of discrimination used. A common conclusion is to use the index which is the easiest to compute and interpret.

Such an index of discrimination is shown on the item analysis reports available from the Scoring Office. This index of discrimination is simply the difference between the percentage of high achieving students who got an item right and the percentage of low achieving students who got the item right. The high and low achieving students are usually defined as the upper and lower twenty-seven percent of the students based on the total examination score. This difference in percentages is expressed as a whole number as a matter of convenience.

A useful rule of thumb in interpreting the index of discrimination is to compare it with the maximum possible discrimination for an item. The maximum possible discrimination is a function of item difficulty. When half or less of the sum of the upper group plus the lower group answered the item correctly, the maximum possible discrimination is the sum of the proportions of the upper and lower groups who answered the item correctly. For example, if 30% of the upper group and 10% of the lower group answered the item correctly, the maximum possible discrimination is 30 plus 10, or 40. This maximum possible discrimination would occur when 40% of the upper group and none of the lower group answered the item correctly.

Note that the actual discrimination of the example is 20. It might be said that the discriminating efficiency of the item, which is the ratio of the actual discrimination to the possible discrimination, is 50%. See Item A in Table 1.

When more than half of the sum of the upper group plus the lower group answer an item correctly, the maximum possible discrimination is 200 minus the sum of the proportions of the upper and lower groups who answered the item correctly. For example, if 96% of the upper group and 84% of the lower group answered the item correctly, the maximum possible discrimination for the item would be 200 minus 180 (96 plus 84), or 20. Since the actual index of discrimination for the item is 96 minus 84, or 12, the discriminating efficiency of the item is 12/20 or 60%. See Item B in Table 1.

It is important to recognize that an item which half of the students answer correctly has the highest possible discriminating potential. Consider an item which 80% of the upper group and 20% of the lower group answer correctly. According to the rule of thumb for items answered by half or less of the students, the maximum discriminating ability of the item is 80 plus 20, or 100. Since the index of discrimination of the item is 60, the discriminating efficiency is 60%. See Item C in Table 1. As the difficulty of an item varies so that more than half of the combined upper and lower groups answer the item correctly, the discriminating ability will decrease from 100. The lower limit of the maximum discriminating ability is zero when all of the combined upper and lower groups, or none of them, answer an item correctly.


TABLE 1

Item A
Item B
Item C
Proportion Right - Upper Group
30%
96%
80%
Proportion Right - Lower Group
10%
84%
20%
Index of Difficulty



  (Average proportion wrong)
80
10
50
Index of Discrimination



  (Proportion Right - High) minus (Proportion Right - Low)
20
12
60
Maximum Discrimination
40
20
100
Discriminating Efficiency



  (Index of Disc./Max. Disc.)
50%
60%
60%

The techniques discussed above enable one to determine the upper limit of the index of discrimination. In most practical situations, determining a lower limit for the index of discrimination is not a problem, since the most discriminating items are selected from the available item pool. The practical rule is the higher the discrimination, the better.

However, there are a number of techniques which may be used to determine a lower limit below which the index of discrimination is not significantly different from zero. The first, and most tedious, would be to determine the statistical significance of the difference between two proportions, that is, the difference between the proportion of the upper group who answered the item correctly and the proportion of the lower group who answered the item correctly.

A second method would be to use a specially prepared table such as the one in Appendix A of Julian Stanley's Measurement In Today's Schools, (fourth edition). Prentice-Hall, 1964. Table A-5 (pp 353-355) indicates the level at which an item can be considered sufficiently discriminating in terms of numbers of persons. The number of persons must be converted to a proportion before relating it to the index of discrimination given on the item analysis report. Use of this table is convenient and gives values appropriate for 2, 3, 4, or 5 option items.


A third method of determining the statistical significance of the index of discrimination would be to compute its standard error. This might be accomplished by doing an item analysis on two samples of a large group. The reliability of the index of discrimination may be determined by correlating the pairs of values from the two item analyses. The rule may then be applied that the index of discrimination must be more than twice as large as the standard error in order for the index to be statistically different from zero at the 2.5 percent level of significance. Experience with University College final examinations has shown that the standard error technique and the use of Stanley's table result in the establishment of almost identical criteria for testing the significance of the index of discrimination when item analyses are based on 500 students. Comparable criteria will also be developed by applying the technique of determining the statistical significance of the difference between two proportions, when the items have difficulty indices of approximately 50.