Why is it so important to review every item on a test? One may speculate that as long as the majority of the items on a test are good, there may not be much impact if a few items are problematic. However, the presence of even a few problematic items reduces overall test reliability and validity.
While Question Difficulty is easy to calculate (proportion), Question Quality is a much more cumbersome process. To calculate Question Quality, OnTarget evaluates each response from each student who took the assessment and simultaneously compares each individual students’ performance and their response on that particular item to all the other students’ performance and their response to that particular item on a question-by-question basis. This is called a point-biserial correlation. A large positive point-biserial value indicates that students with high scores on the overall test are also getting the item right (which we would expect) and that students with low scores on the overall test are getting the item wrong (which we would also expect). A low point-biserial implies that students who get the item correct tend to do poorly on the overall test (which would indicate an anomaly) and that students who get the item wrong tend to do well on the test (also an anomaly).
For example, let us consider two students, Student A – who scored well on the assessment and Student B – who scored poorly on the same assessment. Question Quality is looking for what is expected. This means that each individual question score was consistent between and within students and the resulting data for that question can be trusted. So how is this done? Let’s assume for example, on Question 1 – Student A who scored well on the test should have gotten Q1 correct; however, if he/she got it incorrect while simultaneously Student B, who scored poorly got Q1 correct, then Q1 is not performing as expected, its results are not consistent and therefore, doesn’t correlate. A perfect correlation score is 1, which means that the question performed exactly how expected and is consistent. Those who scored well got it correct and those who scored poorly got it incorrect. However, the farther the score diverges from 1 or the closer to 0, the worse the question performed, meaning, the question wasn’t consistent and did not perform how expected. Ultimately, this means that the results derived from that particular question can not be trusted and requires further investigation. A point-biserial value of at least 0.15 is recommended, though our experience has shown that “good” items have point-biserials above 0.25.
While educators may reuse items over several years, it is advisable to remove bad items from the item pool. Besides lowering the reliability of the test, bad items also confuse students during the test-taking process. Students cannot be expected to be adept at identifying and rejecting items that appear incorrect and moving on to the next question. They spend time and energy responding to poorly written items.