In the current issue of Science, Li and Agha present an analysis of the ability of the NIH peer review system to predict subsequent productivity (in terms of publications, citations, and patents linked to particular grants). These economists obtained access to the major NIH databases in a manner that allowed them to associate publications, citations, and patents with particular R01 grants and their priority scores. They analyzed R01 grants from 1980 to 2008, a total of 137,215 grants. This follows on studies (here and here) that I did while I was at NIH with a much smaller data set from a single year and a single institute as well as a publication from NHLBI staff.

The authors' major conclusions are that peer review scores (percentiles) do predict subsequent productivity metrics in a statistically significant manner at a population level. Because of the large data set, the authors are able to examine other potentially confounding factors including grant history, institutional affiliation, degree type, career stage) and they conclude the statistically significant result persists even when correcting for these factors.

Taking a step back, how did they perform the analysis?

(1) They assembled lists of funded R01 grants (both new (Type 1) and competing renewal (Type 2) grants from 1980 to 2006.

(2) They assembled publications (within 5 years of grant approval) and citations (through 2013) linked to each grant.

(3) They assembled patents linked either directly (cited in patent application) or indirectly (cited in publication listed in application) for each grant.

There are certainly challenges in assembling this data set and some of these are discussed in the supplementary material to the paper. For example, not all publications cite grant support and other methods must be used. Also, some publications are supported by more than one grant and, in this case, the publication was linked to both grants.

The assembled data set (for publications) is shown below:

By eye, this shows a drop in the number of linked publications with increasing percentile score. But this is due primarily to the fact that more grants were funded with lower (better) percentile scores over this period. What does this distribution look like?

I had assembled an NIH-wide funding curve for FY2007 as part of the Enhancing Peer Review study (shown below):

To estimate this curve for the full period, I used success rates and numbers of grants funded to produce the following:

Of course, after constructing this graph, I noticed that Figure 1 in the supplementary material for the paper includes the actual data on this distribution. While the agreement is satisfying, I was reminded of a favorite saying from graduate school: A week in the lab can save you at least an hour in the library. This curve accounts (at least partially) for the overall trend observed in the data. The ability of peer review scores to predict outcomes lies in more subtle aspects of the data.

To extract the information about the role of peer review, the authors used Poisson regression methods. These methods assume that the distribution of values (i.e. publications or citations) at each x-coordinate (i.e. percentile score) can be approximated as a Poisson distribution. The occurrence of such distributions in these data makes sense since they are based on counting numbers of outputs. The Poisson distribution has the characteristic that the expected value is the same as its variance so that only a single variable in necessary to fit the trends in an entire curve that follows such a distribution. The formula for a Poisson distribution at a point k (an integer) is f = (λ^k*e^-λ)/k!. Here, λ corresponds to the expected value on the y axis and k corresponds to the value on the x axis.

Table 1 in the paper presents "the coefficient of regression on scores for a single Poisson regression of grant outcomes on peer review scores." These coefficients have values from -0.0076 to -0.0215. These values are the β coefficients in a fit of the form ln(λ) = α + βk where k is the percentile score from 1 to 100 and λ is the expected value for the grant outcome (e.g. number of publications).

From the paper, a model which includes corrections for five additional factors (subject-year, PI publication history, PI career characteristics, PI grant history, and PI institution/demographics (see below and supplementary material for how these corrections are included)), the coefficient of regression for both publications and citations is β = -0.0158. A plot of the value of λ as a function of percentile score (k) for publications (with α estimated to be 3.7) is shown below:

The shape of this curve is determined primarily by the value of β.

The value of λ at each point determines the Poisson distribution at the point. For example, in this model at k=1, λ=39.81 and the expected Poisson distribution is shown below:

There will be a corresponding Poisson distribution at each percentile score (value of k). These distributions for k=1 and k=50 superimposed on the overall curve of λ as a function of k (from above) are shown below:

This represents the model of the distributions. However, this does not take into account the number of grants funded at each percentile score shown above. Including this distribution results in an overall distribution of the expected number of publications as a function of percentile score corresponding to this model shown as a contour plot below (where the contours represent 75%, 50%, 25%, 10%, and 1% of the maximum density of publications):

This figure can be compared with the first figure above with the data from the paper. The agreement appears reasonable although there appear to be more grants with a smaller number of publications than would be expected from this Poisson regression model. This may reflect differences in publication patterns between fields, the unequal value of different publications, and differences between the productivity of PIs.

With this (longwinded) description of the analysis methods, what conclusions can be drawn from the paper?

First, there does appear to be a statistically significant relationship between peer review percentile scores and subsequent productivity metrics for this population. This relationship was stronger for citations than it was for publication numbers.

Second, the authors studied the effects of correcting force various potential confounding factors. These included:

(i) "Subject-year" determined by correcting for differences in metrics by study section and by year as well as by funding institute. This should at least partially account for differences in fields although some study sections review grants from fields with quite different publication patterns (e.g. chemistry versus biochemistry or mouse models versus human studies).

(ii) "PI publication history" determined by the PIs publication history for the five years prior to the grant application including the number of publications, the number of citations up to the time of grant application, the number of publications in the top 0.1%, 1% and 5% in terms of citations in the year of applications and these same factors limited to first author publications or last author publications.

(iii) "PI career characteristics" determined by Ph.D., M.D., or both, and number of years since the completion of her/his terminal degree.

(iv) "PI grant history" categorized as one previous R01 grant, more than previous R01 grant, 1 other type of NIH grant, or 2 or more other NIH grants.

(v) "PI institution/demographics" determined as whether the PI institution falls within the top 5, top 10, top 20, or top 100 institutions within this data set in terms of the number of awards with demographic parameters (gender, ethnicity (Asian, Hispanic) estimated from PI names.

Including each of the factors sequentially in the regression analysis did not affect the value of β substantially, particularly for citations as an output. This was interpreted to mean that the statistically significant relationship between percentile score and subsequent productivity metrics persists even correcting for these factors. In addition, examining results related to these factors revealed that (from supplementary material):

"In particular, we see that competing renewals receive 49% more citations, which may be reflective of more citations accruing to more mature research agendas (P<0.001). Applicants with M.D. degrees amass more citations to their resulting publications (P<0.001), which may be a function of the types of journals they publish in, citation norms, and number of papers published in those fields. Applicants from research institutions with the most awarded NIH grants garner more citations (P<0.001), as do applicants who have previously received R01 grants (P<0.001). Lastly, researchers early in their career tend to produce more highly cited work than more mature researchers (P<0.001)."

So what is the bottom line? This paper does appear to demonstrate that NIH peer review does predict subsequent productivity metrics (numbers of publications and citations) at a population level even correcting for many potential confounding factors in reasonable ways. In my opinion, this is an important finding given the dependence of the biomedical enterprise on the NIH peer review system. At the same time, one must keep in mind the relatively shallow slope for the overall trend and the large amount of variation at each percentile score. A 1 percentile point change in peer review score resulted in, on average, a 1.8% decrease in the number of citations attributed to the grant. By my estimate (based on the model in this paper), the odds that funding a grant with a 1 percentile point better peer review score over an alternative will result in more citations are 1.07 to 1. The slight slope and the large amount of "scatter" are not at all surprising given that grant peer review is largely about predicting the future, is a challenging process, and the NIH portfolio includes many quite different areas of science.

One disappointing aspect of this paper is the title: "Big names or big ideas: Do peer-review panels select the best science proposals?" This is an interesting and important question, but the analysis is not suited to address it except peripherally. The analysis does demonstrate that PI factors (e.g. publication history, institutional affiliation) do not dominate the effects seen with peer review, but this is does not really speak to "big names" versus "big ideas" in a more general way. Furthermore, while the authors admit that they cannot study unfunded proposals, it is likely that some of the "best science proposals" fall into this category. The authors do note that some of the proposals funded with poor percentile scores (presumably picked up by NIH program staff) were quite productive.

There is a lot more to digest in this paper. I welcome reactions and questions.

UPDATE

Aaron and Drugmonkey commented on the fact that the figure showed an expected value of 40 publications for grants at the 1st percentile. As I noted in the post and in the comments, the analysis depends on a parameter α which does not affect the conclusion about the predictive power of percentile scores but does affect the appearance of the curves. When I first started analyzing this paper, I estimated α to be 3.7 by eye and did not go back and do a reality check on this.

The supplementary material for the paper includes a histogram of the number of publications per grant shown below:

This shows the actual distribution of publications from this data set.

From this distribution, the value of α can be estimated to be 2.2. This leads to revised plots for the expected number of publications at the 1st percentile and an overall expected number of publications per grant shown below:

These data are consistent with results that I obtained in my earlier analysis of one set of NIGMS grants.

I am sorry for the confusion caused by my rushing the analysis.

What do you make of the decision to analyze Type 1 and 2 grants together in the same dataset? On the surface, it seems like it could a source of noise in the data.

I agree. There is some discussion of the differences in the supplementary material and they claim that the results still hold even correcting for these differences, but I need to dig through this more carefully.

While the study examines correlation to previous grant success in terms of being awarded other NIH grants, I would be very interested to see the extent to which the percentile score on individual grants correlate to total concurrent NIH dollars. This is of particular interest since you demonstrated at GM increased productivity with increased funding, up to a certain threshold.

I just can't understand what is valuable about showing that 1%ile difference in voted score leads to 2% difference in total citations of papers attributed to that grant award. All discussions of whether NIH peer review is working or broken center on the supposed failure to fund meritorious grants and the alleged funding of non-meritorious grants.

Please show me one PI that is upset that her 4%ile funded grant really deserved a 2%ile and that shows that peer review is horribly broken.

I think it is a basic question of characterizing the assay. Suppose that the result had been that a 1%ile difference had led to a 50% difference in total citations (I agree this would have been a surprising finding). This could have policy implications. Similarly, if the difference had been in the opposite direction (more citations for grants with poorer scores). I think the results obtained are consistent with at least my expectations based on my earlier analyses, but it is important to have them documented more thoroughly.

With regard to your last point, I got into a public discussion with a Nobel Prize winner who was claiming that the 139 he received for his funded grant that led to the Nobel Prize was a poor score (compared to his previous 100). See http://www.sciencemag.org/content/319/5865/900.4.full.

Ok. You have a point that it *could* have shown something amazingly different and confirming ~the null therefore has value.

Maybe I'm missing something, but a highly scored R01 results in ~40 expected publications? As in 8 per year? More than one every other month?

If look at the first figure with the actual data, most grants even at the first percentile had substantially less than 40 publications. The number 40 may be a result of my mis-estimating the parameter alpha (which I estimated by eye) or of the deviations of the distributions from Poisson distributions. I would estimate that the true median is likely 25-30 for grants at the 1st percentile (almost all of which are likely competing renewals).

So total papers including a prior funded interval?

Paper says pretty clearly that they are counting pubs within 5 years of grant approval. Which makes it really hard to believe this density all the way to 40-50 pubs.

https://loop.nigms.nih.gov/2010/09/measuring-the-scientific-output-and-impact-of-nigms-grants/

This looks more like reality. How are 40 papers so common?

Again, the Poisson distribution is my eyeball estimate and it may be high. Given the number of publications in the study, it is hard to judge what the curve at a given percentile score looks like.

And doesn't this seem... concerning? That's ~$31k per paper, less than 8 postdoc-months (definitely a unit) even assuming absolutely no M&S expenses! If there are so many grants that somehow completely defy reasonable explanation, it makes me wonder about drawing conclusions from the data.

Of course, proposals with lower, unfundable scores will have very little productivity since they have no funds to do the study.

Any way to determine if number pubs and grant %tile scores are conflated from double-dipping (citing 40 papers each on 2-3 different R01's, back when it was in fashion for a single lab to have at 2-3 grants easily)?

datahound, what I take from all of this is that while lower scored applications have a lower "productivity", the difference isn't that steep.

However, the "relationship between NIH funding rates and percentile scores" is rather steep.

This says to me that while the study sections are doing the best they can, the program staff/council, who make the funding decisions, should be doing much more picking and choosing. Why are all the applications with a less than 10% score getting funded? I know I'd be screaming crazy if I had an application scored at 6% and it wasn't funded, but what if it overlaps with something already funded? Isn't it better for NIH to invest in other areas of research?

Just as an aside, I have a friend who is a patient-advocate. This drives him crazy, because it implies that the goal of NIH is to maximize the number of papers published (or citations) with their funds. He wants to see death rates drop, diseases cured, etc. He doesn't care about scientists' career.

I agree with your analysis and with the fact that this supports the potential role for advisory councils and NIH program staff selecting applications within ranges and looking carefully at overlap and productivity for well-funded investigators. With that said, such funding recommendations should be made by processes that are transparent as possible with relatively well-defined criteria so that decisions are made based on judgements of scientific and medical potential rather that other factors.

I have considerable sympathy for the views of patient-advocates (and patients). However, it is quite challenging to link changes in parameters such as mortality with particular grants (or even fields). Moreover, while it is tempting to focus on grants that have a shorter path to the clinic, time has shown that basic knowledge is just as important as translating results that you do not understand fairly deeply usually fails. It takes both basic research (and associated publications and patents) and translational research (to move the knowledge forward) to improve human health.

Has their been any change in policy or procedure over the years for program staff/councils screening applications, as opposed to strictly following percentile scores?

You posted two graphs of percent funded vs percentile score above. One for 2007 and the other for 1980-2006. It appears that the 1980-2006 graph has a gentler slope, as in staff/council reviewed more applications. While the 2007 staff/council followed study sections more closely.

The gentler slopes in these curves are large due to differences in paylines or effective paylines between institutes (since this is for all of NIH) rather than due to program behavior. The curves for individual institutes and centers for a given year would almost certainly be steeper. I do not know if there has been any substantial change in program behavior over this period of time.

Regards the update... The most frequent value is zero? Do I have that right?

Yes, it appears that about 16-17% of the grants had zero publications linked to them. Note that the methods used may miss publications (for example, many journals did not allow the inclusion of grant numbers in acknowledgements and I don't know how well their methods captured these).

fair point. OldeSkoolers are/were used to citing "NIH support" without a specific grant number specifically so as to facilitate double-dipping on their competing continuation descriptions of progress. May have taken them some time to gradually start referencing the specific grant. As this process gets tighter and tighter from the NIH end- and publishers facilitate easy linkage during manuscript submission- one would expect this to improve.