Reliability tells us whether a test is likely to yield the same results if administered to the same group of test-takers multiple times. Another indication of reliability is that the test items should behave the same way with different populations of test-takers, by which is generally meant that the items should have approximately the same ranking when sorted by their p-values, which are indicators of “item difficulty.” The items on a math test administered to fifth graders in Arizona should show similar p-value rankings when administered to fifth graders in California. The same test administered the next year to another cohort of fifth graders should again show similar rankings. Large fluctuations in the rank of item (one year it was the easiest item; the next year it was the hardest on the same test) would indicate a problem with both the item and, by implication, the test itself.
A test is said to be reliable when a student would receive nearly the same score if he/she were to retake the test. This concept of reliability is based on the idea that repeated administrations of the same assessment should generate consistent results. However, we seldom have time to test and retest students on the same assessment to see if their scores are consistent. Reliability based on one test administration is known as Internal Consistency because it measures the consistency with which students respond to the items within the test. This calculation of reliability or internal consistency is called a Cronbach’s Alpha and ranges from 0-1, with 0 indicating No reliability and 1 indicating the highest reliability.