Method of reviewing questions on a test to ensure that they all meet minimum quality control criteria. The objective of a qualitative review is to identify problems on a test. Items may be problematic due to one or more of the following reasons:
- Items may be poorly written, causing students to be confused when responding to them.
- Graphs, pictures, diagrams, other information, accompanying the items may not be clearly depicted or may be misleading.
- Items may not have a clear correct response, and a distractor could potentially qualify as a correct item.
- Items may contain distractors that most students can see are obviously wrong, increasing the odds of student guessing the correct answer.
- Items may represent a different content area than that measured by the test of the test.
- Bias for or against a gender, ethnic or other student group may be present in the item or distractors.
One concept of item analysis is concerned with the difficulty of items relative to the population of persons administered the test to assess the item. Item difficulty, simply stated, is the proportion of persons who answered the item correctly which is called the facility of an item and usually denoted by the letter p. The left-hand side depicts the individual p-value of each question. The right-hand side depicts the frequency distribution of the p-values on the entire assessment.
The point-biserial correlation describes the relationship between a student’s performance on a multiple-choice or gridded-response item (scored correct or incorrect) and performance on the assessment as a whole. A high point-biserial correlation indicates that students who answered the item correctly tended to score higher on the entire test than those who missed the item.
Evidence of Validity
Validity refers to the extent to which a test measures what it is intended to measure. When test scores are used to make inferences about student achievement, it is important that the assessment supports those inferences. In other words, the assessment should measure what it was intended to measure in order for any uses and interpretations about test results to be valid. Validity evidence for an assessment can come from a variety of sources, including test content, response processes, internal structure, relationships with other variables, and analysis of the consequences of testing (TEA, Technical Digest 2018-2019). Validity evidence based on test content supports the assumption that the content of the test adequately reflects the intended construct. It is imperative that each test item be reviewed for alignment, appropriateness, adequacy of student preparation, and any potential bias (2018-2019 Technical Digest). Evidence of validity will be measured using the following guidelines:
- TEKS Alignment
- Bias and Sensitivity
- Language and Vocabulary
- Structure and Context
- Answer Choices
Common Assessment Data File Upload (required)
- Question Difficulty (p-Value): The p-Value of an item provides the proportion of students that got the item correct, as a proxy for item difficulty (or more precisely, item easiness). The higher the p-value the easier the item. Low p-values indicate item difficulty.
- Optimum Question Difficulty Range: 0.3 – 0.7
- Question Quality (point-biserial correlation): The correlation between the right/wrong scores that the students receive on a given item and the total scores that the students receive when summing up their scores across the remaining items. A low point-biserial implies students who got the item incorrect also scored high on the test, while students who got the item correct scored low on the test overall.
- 0.9 and Below – Poor Questions
- 0.1 – 0.19 – Marginal Questions
- 0.2 – 0.29 – Fairly Goof Questions
- 0.3 – 0.39 – Good Questions
- 0.4 and Above – Very Good Questions
- Evidence of Validity: The assumption that the content of the test adequately reflects the intended construct.
- Action Taken: Define an action for each individual question.
- Notes: Open space for notes to be entered and saved.