About the Ranges of Rankings

Of Ranges and Rankings

A weather report that gives you plainspoken advice ("Take your umbrella today") may seem more helpful than a forecast that offers only probabilities ("There's a 60 percent chance of showers this afternoon"). But having a measure of a statement's uncertainty can be useful. It lets you decide for yourself whether the risk of getting wet outweighs the bother of carrying an umbrella you might not need.

The same kinds of issues come up in assessments of graduate-education programs. The latest such assessment from the National Research Council is especially scrupulous about acknowledging the sources of uncertainty in its findings. Earlier NRC reports, published in the 1980s and 90s, listed doctoral programs in rank order, from best to worst. That simple scheme made comparisons easy-but not necessarily reliable. The new report gives a range of rankings for each program. For example, a program might be described as ranking somewhere between 5th place and 11th place.

What do these ranges mean, and why did the NRC choose to present its results in this "fuzzy" way? We answer these questions briefly here. For those who seek a more detailed mathematical treatment, the NRC has published "A Guide to the Methodology of the National Research Council Assessment of Doctorate Programs," available at http://www.nap.edu/catalog/12676.html. (But note that some procedures changed after this document was written.) Note also that the NRC has compiled two distinct sets of overall rankings based on different ways of deciding which factors are most important in determining the quality of a program; the two sets of rankings are not discussed here. Instead, the differences between them are outlined in a separate document.

Who's on First?

The NRC rankings are based on responses to questionnaires filled out by faculty, students and administrators at cooperating institutions. In most academic disciplines the doctoral programs are scored in 20 categories. All of these criteria are traits that can be measured or counted, such as the number of graduate students in a program and the number of research publications per faculty member per year. A simple procedure for ranking the programs is outlined in the diagram below.

Normalize weights

After collecting the data, the first step is to "normalize" it, so that quantities measured in various units and ranges are all reduced to the same numerical interval-say values between 0 and 1. Then the normalized values are multiplied by weights, or coefficients, that indicate the relative importance attributed to each factor. These weighted values are summed up to yield a composite score for each program. Sorting the scores from highest to lowest gives the ranking of the programs in that discipline.

With a ranking of this kind you can see at a glance which programs are best and which are worst-and that's exactly why the NRC has changed its procedures. The trouble is, what you see at a glance is often misleading. Small differences in rank arise more from statistical noise than from genuine differences between programs. The program at the top of the list looks like it ought to be the best of the bunch, but it might well come in second or third if the whole process were repeated under slightly different circumstances.


The idea of repeating the evaluation procedure many times with minor variations is at the heart of the NRC's new methods for coping with uncertainty. Actually running the survey again and again is not a practical option; university personnel can't be asked to fill out 30-page questionnaires repeatedly. So, instead, the NRC adopts a statistical technique called resampling, which might be described as a kind of simulated repetition.

Suppose you're considering a department where research productivity fluctuates from year to year. When a long-term project is just getting under way, the number of published papers per faculty member is low; several years later, as the project concludes, there's a surge of publications. A rating based on the output in any one year would be an unreliable indicator of program quality. Taking an average over several years gives a better estimate, but the average still fails to capture important information about the variability itself. If you became a doctoral candidate in that program, it would make a big difference whether you found yourself either in a productive phase or a fallow period.

For its new survey, the NRC recorded publication rates over several years and then calculated both the mean number of publications and the standard deviation (a measure of variability). These two parameters define a normal, or Gaussian, probability distribution-a bell-shaped curve with a peak at the mean value and a width determined by the standard deviation. The resampling procedure then generated 500 randomly selected values from this distribution. Because of the shape of the normal curve, values near the mean were the most likely to appear in the random set, but at least a few widely dispersed values would also be expected. The result of this computation was a collection of 500 replicas of the original measured data, with roughly the same properties on average but with variations much like those that would have been observed if the survey could have been repeated 500 times. The same algorithm was applied to the other 19 program variables, each with its own mean and standard deviation.

Normal curves

Weighty Matters

Apart from fluctuations in the measured values of the 20 program variables, there is another source of uncertainty in the doctoral program rankings. When the 20 variables are combined to create a composite score for each program, they are first multiplied by weight coefficients, which thereby determine the relative importance of each factor in the overall score. The weight coefficients are also subject to uncertainty, which the NRC wanted to incorporate into the rankings.

The weights derive ultimately from the judgments of faculty members who were asked to evaluate doctoral programs in their own field. As mentioned above, the NRC devised two methods for extracting weight coefficients from the judgments of the evaluators, leading to two separate rankings (described here ). But both sets of weight coefficients involve uncertainties, which were treated in similar ways.

Unsurprisingly, the evaluators did not all agree on what traits contribute most to the overall quality of a doctoral program. One evaluator might emphasize publication frequency and citation counts for the faculty, another the percentage of students who earn their degree within a given number of years. To account for this variability, the NRC adopted a computational technique called the random-halves method. Suppose there are N evaluators altogether. The algorithm selects a random subset of N/2 evaluators and calculates the average weight they assign to each of the 20 variables. Then the same calculation is done again with a new random subset, also of size N/2, and the process continues for a total of 500 trials. In this way the random-halves method generates an ensemble of 500 sets of weight coefficients that should reflect both the average judgments of the evaluators and the variability in those judgments.

Random halves

We now have all the ingredients for a complete ranking of the graduate programs in a discipline; the ingredients go together according to the following recipe. First, for each measured variable in each program, generate 500 random replicas drawn from the normal distribution described above. This yields 500 sets of 20 numbers for each program. Next generate 500 sets of weight coefficients via the random-halves algorithm, producing another list of 500 sets of 20 numbers. Now combine these two data sets, multiplying each replica variable by one of the weight coefficients, and summing each set of 20 products. The result is a list of 500 composite scores for every doctoral program. Sorting the lists yields 500 distinct rankings of the programs. Finally, the NRC discards the highest and the lowest 5 percent of the rankings for each program; the remaining interval (from the 5th to the 95 percentile) is the range of rankings reported for the program.

What Do the Ranges Mean?

If you're the sort of person who favors a simple imperative weather forecast ("Take the umbrella!"), then you may also long for a graduate-school ranking system that just tells you which program to pick. Unfortunately, the information gathered by the NRC will not support such an unequivocal declaration. There's no choice but to cope with uncertainty.

Given a set of rankings-or, rather, ranges of rankings-like those shown below, which program should you prefer? Ranking ranges

It's easy to see that programs A and B are considered superior to programs G, H and I, since there is no overlap in their rankings. But making finer distinctions can be tricky. Is program B better than C? The graph shows that C's range of rankings extends down to fifth place, whereas B cannot be lower than fourth place. Thus there's a tendency to assume that B must be better. Nevertheless, it is entirely consistent with the data that the "true" ranking-if only we could know it-would put C in the No. 1 spot and B in fourth place.

Supplemental data available at PhDs.org can provide some further guidance in disentangling the ranges of rankings. The web page for each individual program displays a histogram like those shown below. The histograms are created from the 500 sets of resampled scores (with uncertainty in both the measurements of program variables and the weight coefficients). As explained above, the 500 sets of scores are sorted into rank order, and both the top and bottom 5 percent of the rankings are discarded, leaving a total of 475 rankings for each program. The histogram records the number of times a program ranked first, second, third, and so on in the resamplings.


For the hypothetical set of nine programs shown here, the histograms show that program B has its tallest peak at rank 2. The rankings for program C cover a broader range, with a peak at 3. The histograms still do not reveal the ever-elusive "true" rank of a program. But perhaps they come as close as we can get, given the emphasis on uncertainty in the NRC survey.

About the Graduate School Guide