1. Why use tests when you can make judgments without them?
2. What are the three most important characteristics of a good test?
3. What does "reliability" mean?
4. What does "validity" mean, when used to describe a test?
5. What makes a test "useful"?
6. What is a normal distribution?
7. What is a percentile score?
8. What is a T-Score?
9. What is a correlation coefficient?
10. How can I learn about the reliability, validity and usefulness of the tests offered at this web site?
Well-designed tests have several advantages over human judgment based on less objective information. Tests can more quickly and accurately measure many traits. School teachers use tests to measure how much students learn, as it would be too time-consuming for the teacher to interview each student about the subject matter. Also, human judgment is often biased by factors of which the person judging is unaware. For example, people tend to overestimate their intelligence levels, perhaps because being "stupid" is a grounds for social rejection. When we rate another person's personality traits, as when assessing job applicants, we tend to give people higher ratings if they are of our same ethnic background or if we see them as handsome or beautiful. We do this without being aware. Well-designed tests are not contaminated by such biases. Thus, they provide more accurate and fair measures.
Example: In lectures on this topic I ask a group of people to imagine they are screening applicants for a job that requires past national government leadership experience, a college education, high intelligence and good control of personal sexual feelings on the job. I ask them to rate a candidate on two of these traits: verbal intelligence and personal feeling control. They are to use a scale from 1 (low) to 10 (high). The job candidate I have them rate is William J. Clinton, ex-President of the United States. The ratings for intelligence that I get typically range from about 3 to 10. For "personal feelings control" they range from 1 to 7. Clinton was a Rhodes scholar and almost certainly had verbal intelligence above the 90th percentile (above and I.Q. Of 120). Therefore, on a reliable test of verbal intelligence he would earn a score equivalent on our rating scale of "10". A good test of sexual feeling control on the job might have given a score of 4. The test scores would not vary from one examination to another, being more reliable than simple human judgment which, in this example, varies widely, from 3 to 10 and from 1 to 7. For most raters, Clinton's sexual indiscretions probably seemed "stupid". This impression probably lowers their estimates of his intelligence, distorting their rating of him on this trait. To be fair to Mr. Clinton in a hiring situation, tests would be more appropriate than the judgment of one or another of our raters.
Reliability, validity and usefulness.
Test reliability is the accuracy with which a given score for a given individual person measures the trait in question. For a score to be reliable the person must take the test carefully and conscientiously. As far as the test itself is concerned, good reliability can be assured if the test questions are carefully written and there are enough of them. For some traits, 30 questions are desirable. For other traits, only 6 or 10 questions are enough. For gender, age, years of education and high school grade point average, only one question each is enough.
If we wish to measure several aspects of a trait, such as verbal, spatial and memory aspects of intelligence, we may need 30 questions for each aspect. If we plan to score a test using different norms for each of several age levels, then more than 30 questions may be necessary to obtain reliable measures for the full range of the trait at each age level.
Well-designed tests include items which have been carefully crafted and have passed one or more statistical tests to assure that they are contributing well to the total test score.
Reliability is indicated by a statistic, such as an "alpha coefficient". Reliability of .70 is sometimes adequate. .80 is good. .90 or above is excellent.
Test validity is the accuracy with which a test measures the trait it claims to measure. The test must first have adequate reliability, as described above. To be valid, content of the test questions should look right; a test claiming to measure arithmetic addition skills should consist of addition problems, not subtraction or division problems. A test of the personality trait of Extroversion should contain items about social interactions with people, not feelings of depression or anxiety. Another way to document validity is to see if scores on the test are concurrently related as expected to other information, such as scores on other tests that are trusted to measure the same thing and other information to which the trait is related. For example, verbal intelligence is known to be positively related to school grades; persons with higher intelligence tend to get higher grades. Therefore, any test of verbal intelligence should show such a positive relationship.
A test is useful if it helps someone make decisions more effectively than without it. Tests are found useful to measure progress in school classes, how much teenagers and adults know about State driving rules and how much intelligence and background knowledge is had by persons applying for college and for the Armed services. They are found useful by employers when hiring for private industry and by the government, such as the Postal Service, which uses the Civil Service Examination to screen postal worker applicants.
Some tests are more useful than others. For example, one test for depression may be quite reliable and valid but only provide one score for overall depression, another test may also provide separate score for aspects of depression, such as suicidal tendencies and personal problem areas. The second test may be more useful because it provides this added detail. I built my test for depression to provide many scores for separate aspects of depression, including suicidal tendencies and causes. I find this test more useful because it provides more information important when assessing depressed clients.
A normal distribution is a pattern of test scores arranged from lowest to highest. It shows the frequency of scores at each level. A normal distribution of scores is highest in the middle and tapers smoothly to each end, in a bell shape. Most complex biological and psychological traits are normally distributed. Height, weight, intelligence, Extroversion, depression and business management aptitude are all examples. These traits are all "complex" in that the underlying factors contributing to them are numerous. For example, many facets make up intelligence as a global trait describing a person's aptitude for understanding and solving problems in general.
Some psychological traits are not normally distributed, but are skewed to one side. Homicide endorsement, is an example. Most persons get very low scores on a measure of this trait; most persons do not endorse murder as a way to solve personal problems. A few persons do, and their scores trail off in a rather thin stream to the right of the majority in a typical frequency distribution graph or chart.
A percentile score is a standard score which tells where a given raw score falls relative to other persons who have taken the test. Percentile scores range from 1 to 100. If on a test of 30 questions a raw score of 16 is at the 50th percentile, then 50 out of 100 people who take the test are likely to get raw scores of 15 or lower and the rest higher than 15. A percentile score of 90 means 90 of 100 persons who have taken the test have gotten raw scores lower than the one corresponding to the 90th percentile. Percentile scores help tested persons understand what their test scores mean by telling them how they did on the test compared to other people.
A T-score is another standard score, like the percentile score in some respects. T-scores are typically set with a mean (average) of 50 and two thirds of all scores falling between 40 and 60. T-scores can be set with a mean of 50 and two thirds of scores falling between 22 and 78. In this system, most T-scores will fall between 1 and 100, approximating percentile scores. T-scores are more appropriate than percentile scores for research purposes, so they are often included in test reports.
It is a statistic widely used in psychological research, including test design. It shows the degree of relationship between two measures. It can range from - 1.00 to + 1.00. If high scores on one trait (e.g. intelligence) are associated with high scores on the other measure (e.g. school grades), then the correlation is positive, e.g. .52. If high scores on one measure are associated with low scores on the other, then it is negative, e.g. -.68 between a measure of warmongering disposition and intelligence. When correlations are so high that they are very unlikely to have occurred by chance alone, then researchers can be confident that the two traits are significantly related to each other and advise persons to make decisions based on the test scores for those traits. If correlations are based on very large samples of persons, e.g. 100 or more, then correlations even as low as .20 can be significant (not due to chance) and provide valuable information.
You may read the description of each test in the Tests section. If you are a professional, you can register as one and then read the manual for each test in the web site.