## Which countries are research giants?

On its news website, the journal Nature recently published a graphic illustrating scientific research activity around the world as measured by the number of papers published from January to October 2011. Nature obligingly arranged the data by country.

A glance at the graphic tells you that the U.S., China, the U.K. and certain European countries are the big producers of scientific research. But does that mean that these countries are doing a good job educating their citizens and encouraging research? A closer examination of the data says no.

### Here are the data

I’ve sorted it in descending number of papers.

I’ve put them in descending order. True, the United States, China, Britain and Germany as the big contributors to scientific progress. Those countries lead the pack, while small, developing countries, such as Thailand, trail.

### But there’s more

But there’s more here than meets the eye. Of course, the US tops the list: as well as being developed economically, it’s got a substantial population, and consequently a large number of scientists. As you might expect, countries flush with people, like India and China, are also near the top.

But are those top publishers really pulling their weight in terms of production per person?

One way of adjusting for population size is to compute the number of papers per hundred thousand inhabitants. Taking Wikipedia as the source of population, we get the following. (I used Excel to compute papers per hundred thousand and then sorted the data in descending order of this ratio.

Now Switzerland heads the list, followed by the Scandinavian four and the Netherlands. Of these, only the Netherlands has more than 10 million inhabitants. The US has been topped by 16 other countries, including the United Kingdom and three British Commonwealth countries, Canada, Australia and New Zealand. China and India have plunged.

### But there’s still more

Is there a benefit to being a small country? It’s worth plotting the data on a graph with the population of the country as the horizontal axis and the “papers per 100K” on the vertical axis.

Those two outliers, India and China, with their large populations, mask any relationships hiding in the rest of the data. The US is a bit of an outlier, too.

In order to pick apart the data, I’ll delete China and India as special cases. This corrupts the generalization of the analysis. So noted. Here’s the new graph.

Now a relationship is beginning to emerge. Small countries do have an advantage. But the US — that point out to the right — is still anomalous; and the  plot is not linear — it’s got an L-shape.

### Time for some statistical pyrotechnics

Skip over this section if you are uncomfortable with logarithms.

In a situation like this, statisticians usually reach for a transformation or two. After fiddling about with the data, I chose two.

I replaced the “papers per 100K” with the logarithm of that variable; and I did the same with the population variable. (“What kind of logarithm?” you may ask. I used natural logs (base e), but base 10 works just as well, so long as you remain consistent.)

I then fit a regression line (the best fitting straight line) to the data.

The statistical sophisticate will note the heteroskedasticity here; I’m ignoring it.  The plot looks nice and linear. The US is still an outlier.

The regression equation, with w = Log(papers per 100k) and z = Log(Population) is

$w = -0.5719z + 6.0707$

The F-value is 25.04 (1,36) df. Significance p < 0.0001.

(Log base 10 transformations would have yielded different coefficients, but identical F and p statistics.)

It is informative to transform the equation back into terms of x and y where y = papers per 100k and x = population in millions. We have a = -0.5719 and b = 6.0707. So our equation is

$w = az + b$

$\log(y) = a\log(x) + b$

Taking exponents (base e) of both sides leads to

$y= \exp(a\log(x)+b) = x^a e^b$

Putting in values for a and b leads to

$y = \frac{e^{6.0707}}{x^{0.5719}}$

The numerator is $e^{6.0707} = 433.0$, rounded to 1 decimal place.

### Math phobes, you may return

The relationship between the “papers per 100k” (designated by y) and “population in millions” (x) is summarized in the equation

$y = \frac{433.0}{x^{0.57}}$

We can plot this as a line on the graph we looked at before.

### Now, which countries are producing above their weight?

The dots above the line represent the countries whose output exceeds what is explained by the size of their populations. They are:  Switzerland, Denmark, Sweden, Norway, the Netherlands,  Finland,  Australia,  Belgium, UK, Canada,
Israel, Austria, Taiwan, Germany,  US, Spain, France, South Korea, Italy and Japan.

Now all these are well-developed nations with relatively high education standards. If we wanted to delve into the data more deeply, we could include, along with population size, such factors as public education, economics, health and standard of living.

But I’ve said and done enough.

### Caveat

Unhappily, Nature News does not say how it classified the papers, but presumably that is based on the location of the lead author of each paper. Happily, the graphic includes real data. Unhappily, Nature News does not give a source for the data.