Friday, October 19, 2007

The Metrics are Running the Insane Asylum

James Watson really stuck his foot in his mouth. Much ink and pixels has been spilled about what he allegedly said and meant (he says he was misquoted or quoted out of context).

Even if he did say the quotes in the Sunday Times, I think the professional pundits are missing something important.
He says that he is “inherently gloomy about the prospect of Africa” because “all our social policies are based on the fact that their intelligence is the same as ours – whereas all the testing says not really”, and I know that this “hot potato” is going to be difficult to address. His hope is that everyone is equal, but he counters that “people who have to deal with black employees find this not true”. He says that you should not discriminate on the basis of colour, because “there are many people of colour who are very talented, but don’t promote them when they haven’t succeeded at the lower level”. He writes that “there is no firm reason to anticipate that the intellectual capacities of peoples geographically separated in their evolution should prove to have evolved identically. Our wanting to reserve equal powers of reason as some universal heritage of humanity will not be enough to make it so”.
A test score gap is not the same thing as an intelligence gap. On what basis do we believe that these tests actually measure "intelligence"? What is intelligence anyway and is it measureable? Why blame the people who score low and not the test makers?

When a professor at Berkeley gave an exam in which the scores were mostly clustered around a lower point than he usually aimed for, he apologized to the class. He said that it showed he had not taught us well, and he had not designed the test well. We recovered some of the material which the class had struggled with in the exam. Then he wrote another exam and had several graduate students take the exam to make sure that the wording on the second test was clear. After refinement, he adminstered the second test to the undergraduates.

What makes an exam good in his eyes? First, it should cover understanding of the material. Second, he believed that the test should be that happy medium of difficulty that created a spread in the test scores. If the scores were all clustered around 30-40%, he couldn't sort the strong from the weak students. He aimed for a Gaussian (aka normal aka bell curve) distribution with an average of 70% and about 10% of the class above 90%.

That got me thinking about exams as a reflection of the test maker instead of the test takers. If a test does not sufficiently spread out the scores, does it mean that the test is not sufficiently exploring the differences between the test takers?

This is of particular interest to me as a woman in science. Much has been made about the gender differences in standardized test scores, especially in the mathematical part of the tests. Women score lower than men in the math part of the SAT and earn higher grades in college. If the SAT is supposed to be as predictor of success in college, then something is wrong with the test.*

On IQ tests, both genders have about the same mean score, but the spread is much larger for men than women. This is used as a justification for the low representation of women in math and science. In general, people in these fields come from the higher 'tails' of the distribution. If there are fewer women in these tails, then the low representation of women in math and science is a reflection of a natural disparity in aptitude instead of subtle (or not so subtle) discrimination in the culture. (Remember Larry Summers?)

Have you ever read a movie or book review which said that the female characters were "flat"? That is, the male writer or movie maker had too weak of an understanding of females to create three-dimensional female characters? Why isn't the same criticism leveled at standardized test makers that are flattening out the differences between females in their scoring schema?

The Atlantic Monthly has run several excellent articles about the history and culture of testing. Read The Structure of Success, The Great Sorting and Lemann on Testing.

Historically, the tests were designed by white men for other white men. If non-whites and females do not score as well as white men, is that the fault of the test-takers or the test-makers? Recall that, not so long ago, people used to say that Asians are genetically inferior because we did not score as well as whites on IQ tests. A generation later, Asians outscore whites. Did our genetics change in just one generation? No, the tests were translated so we could take the tests in our native languages. ;-) OK, there were some other changes in nutrition, schooling and economic gains.

I have written quite a bit about school testing under the Education tag. In particular, Who's keeping score and why? has many links about school testing.

* One argument for the GPA difference is that women more frequently major in the humanities than in engineering or the sciences. Average GPAs in the humanities are significantly higher than in science and engineering. Well, the differences in GPA persist even when controlled for differences in average GPA between majors. In fact, a MIT study in the 1980s showed that, for MIT students with the same high school GPA to earn the same MIT GPA, the boy needed to score 150 points higher on the SAT. It is safe to assume that the girls (and boys) at MIT are all majoring in science and engineering.

Back to why metrics can lead us astray. Let's look at college rankings.

It is October which means it is time to rank colleges. When I look at their metrics, I can't help but think about the ways that higher education has changed in the US in the last twenty years or so since ranking mania hit. Look at the methodology for the US News and World Reports rankings.
  1. Student selectivity accounts for 15% of the score. The more applicants schools attract, the more they can reject, making them appear more selective. 25-75 percentile SAT scores also are counted. That is why schools are increasingly obsessed with applicants' SAT scores. Many schools are buying them by giving "merit" scholarships to students with high SAT scores. Thus, a test score that correlates most highly with parental income is being used as a basis to reward students who are least likely to have financial barriers to college. This reduces the amount available for need-based financial aid, further increasing the affordability gap for students from low and middle income families.
  2. Faculty resources account for another 20% of the score. The two largest factors are faculty compensation (35% of the 20%) and class size (30% of the 20%). Percent full-time faculty only account for 5% of the 20%. A school can increase their ranking by using more part-time non-tenure track faculty at slave wages to bring down their class sizes. They can use part of that cost savings to increase the amount they pay their "super-star" faculty that perform research and teach very little. Voila! They have become a "better school". No wonder the nationwide percentage of college faculty that is tenure track has fallen below 50%. Do part-time "freeway flyer" faculty teaching 2-3 classes at several schools provide a better experience for the students? It doesn't matter at all in this ranking system.
  3. Last, but not least, let's look at $. Financial resources and alumni giving account for 10% and 5% of the total score respectively. The larger the endowment, the higher the score. A generation ago, school endowments were used to subsidize the cost of an education, either by lowering the tuition for all students, or by giving scholarships for low and middle income students. The ranking gives no reward for financial accessibility. Is it any wonder that college endowments and tuition have soared in tandem? The income alone from Harvard's endowment, ~$30 Billion, is sufficient to provide free tuition for all their students. Will they do that? Of course not. Having the largest endowment gives bragging rights and boosts their rankings. Furthermore, arch-rival Yale's endowment, ~$22 Billion, might surpass theirs. (Managing college endowments is now a significant industry covered by the press as if they were horse races. Why doesn't the press ever question why the endowments have grown to such staggering size as annual tuition at selective colleges rises above 100% of the median family income?)
Further reading about college admissions:
The Big Picture The Atlantic Monthly covers college admissions in Fall 2004.
Atlantic Unbound interview with James Fallows

We give in and go with the flow.
So, if we hold such metrics in such contempt, why did we recently sign Iris up for a standardized test? We want to send her to the Center for Talented Youth camp next summer. It is a camp for gifted children. We wanted her to be challenged intellectually and to meet other kids like herself. The only way for her to get into the program is for her to take one of several standardized "intelligence" tests and to score high. Can you believe that the application deadline for camp in June 2008 is December 1, 2007?

I don't want to give the impression that I am against metrics in general. I spend my professional life exploring data and devising metrics to help pull out the important stories in the data. Because of my experiences, I have become acutely aware that poorly selected metrics can skew or miss the story. We need to be careful about the use of metrics. And, we should change the metrics now and then to stay one step ahead of those who would game the metrics.

I must get to sleep. I am taking a class with Cindy Rinne tomorrow.

Addendum: read Are Metrics Blinding Our Perception? I am not advocating using psychics over metrics. But I would like us to think a bit more carefully about what the metrics tell us. What are we really measuring? Are there alternate interpretations?

1 comment:

  1. One idea (not original to me, though I've forgotten the source) that might explain males' greater score spread on standardized tests is that of genetics: XY chromosome pairing may lead to greater inherent variation than XX.

    There are some XX males, so perhaps a well-designed study could measure a sufficient number of their standardized scores to give a hint of the right answer.

    ReplyDelete