On its news website, the journal *Nature* recently published a graphic illustrating scientific research activity around the world as measured by the number of papers published from January to October 2011. Nature obligingly arranged the data by country.

A glance at the graphic tells you that the U.S., China, the U.K. and certain European countries are the big producers of scientific research. But does that mean that these countries are doing a good job educating their citizens and encouraging research? A closer examination of the data says no.

### Here are the data

I’ve sorted it in descending number of papers.

I’ve put them in descending order. True, the United States, China, Britain and Germany as the big contributors to scientific progress. Those countries lead the pack, while small, developing countries, such as Thailand, trail.

### But there’s more

But there’s more here than meets the eye. Of course, the US tops the list: as well as being developed economically, it’s got a substantial population, and consequently a large number of scientists. As you might expect, countries flush with people, like India and China, are also near the top.

But are those top publishers really pulling their weight in terms of production per person?

One way of adjusting for population size is to compute the number of papers per hundred thousand inhabitants. Taking Wikipedia as the source of population, we get the following. (I used Excel to compute papers per hundred thousand and then sorted the data in descending order of this ratio.

Now Switzerland heads the list, followed by the Scandinavian four and the Netherlands. Of these, only the Netherlands has more than 10 million inhabitants. The US has been topped by 16 other countries, including the United Kingdom and three British Commonwealth countries, Canada, Australia and New Zealand. China and India have plunged.

### But there’s still more

Is there a benefit to being a small country? It’s worth plotting the data on a graph with the population of the country as the horizontal axis and the “papers per 100K” on the vertical axis.

Those two outliers, India and China, with their large populations, mask any relationships hiding in the rest of the data. The US is a bit of an outlier, too.

In order to pick apart the data, I’ll delete China and India as special cases. This corrupts the generalization of the analysis. So noted. Here’s the new graph.

Now a relationship is beginning to emerge. Small countries do have an advantage. But the US — that point out to the right — is still anomalous; and the plot is not linear — it’s got an L-shape.

### Time for some statistical pyrotechnics

Skip over this section if you are uncomfortable with logarithms.

In a situation like this, statisticians usually reach for a transformation or two. After fiddling about with the data, I chose two.

I replaced the “papers per 100K” with the logarithm of that variable; and I did the same with the population variable. (“What kind of logarithm?” you may ask. I used natural logs (base e), but base 10 works just as well, so long as you remain consistent.)

I then fit a regression line (the best fitting straight line) to the data.

The statistical sophisticate will note the heteroskedasticity here; I’m ignoring it. The plot looks nice and linear. The US is still an outlier.

The regression equation, with w = Log(papers per 100k) and z = Log(Population) is

The F-value is 25.04 (1,36) df. Significance p < 0.0001.

(Log base 10 transformations would have yielded different coefficients, but identical F and p statistics.)

It is informative to transform the equation back into terms of x and y where y = papers per 100k and x = population in millions. We have a = -0.5719 and b = 6.0707. So our equation is

Taking exponents (base e) of both sides leads to

Putting in values for a and b leads to

The numerator is , rounded to 1 decimal place.

### Math phobes, you may return

The relationship between the “papers per 100k” (designated by y) and “population in millions” (x) is summarized in the equation

We can plot this as a line on the graph we looked at before.

### Now, which countries are producing above their weight?

The dots above the line represent the countries whose output exceeds what is explained by the size of their populations. They are: Switzerland, Denmark, Sweden, Norway, the Netherlands, Finland, Australia, Belgium, UK, Canada,

Israel, Austria, Taiwan, Germany, US, Spain, France, South Korea, Italy and Japan.

Now all these are well-developed nations with relatively high education standards. If we wanted to delve into the data more deeply, we could include, along with population size, such factors as public education, economics, health and standard of living.

But I’ve said and done enough.

### Caveat

Unhappily, Nature News does not say how it classified the papers, but presumably that is based on the location of the lead author of each paper. Happily, the graphic includes real data. Unhappily, Nature News does not give a source for the data.

A lovely piece of analysis, Alan. Your quickness to settle on a linear model for the logarithmic data will of course be criticized but I notice a brilliance in the way you have used this model: regardless of the model chosen a curve looking roughly like the one you have drawn and in that approximate position will, by necessity, result, or it is an obvious poor fit, no analysis necessary. And, with any such curve, the collection of countries “punching above their weight” in the corresponding analysis will remain almost the same. I think a statistician might call this a robust analysis, in that the specifics of coefficients of variables and model chosen really won’t change one’s conclusions much. I will note that India and China, inserted after the model is given, fall into the “punching above one’s weight” category, though not by as much as the U.S. I don’t know about India, but in educational assessments one generally treats China by regions, and it is seen that the cities Beijing, Shanghai and Hong Kong display population characteristics similar to that of countries but are off the scale educationally (and their omission from the national Chinese data significantly depresses conclusions about the country overall). If Nature or someone produces such data you might use them to refine these outliers a bit.

You might have some fun with the PISA, CPAC and TIMSS assessments in mathematics. I’m curious what a competent statistician with time to hit the data with a few 2x4s will find. In particular, those outside the field are not good at evaluating the significance of the drop in Western Canadian scores since 2003, and I think there is a fair amount of nonsense about this cropping up in the educational literature.