The Demise of 38,000 NIH-funded Investigators

(by datahound) Aug 19 2015

In my first post using R to analyze NIH data, I examined the number of unique investigators funded by NIH per year as a function of time. The definition of "unique PIs" was based on the number of unique "Contact PI Person ID" numbers in the NIH RePORT database from 1985 to 2014. Overall, this number was 216,521.

As I prepared my data set for more analysis, I discovered that some investigators had more than one Contact PI Person ID number. I have spent the past 2 months trying to sort this out and I am still not done. An investigator in the intramural program has well over 1oo ID numbers over time! Getting this sorted out is crucial for future analyses, particularly longitudinal ones that are so important. Otherwise, an investigator might appear to have a gap or termination in funding just because their ID number changed.

In addition, there are problems the other direction with multiple names associated with one ID number. A very small number of these appear to be cases where different people have been assigned to the same ID number. Most are related to non-uniformity in how names are entered (e.g. with or without a middle initial, with or without a period on the middle initial). Some are good to have been captured such as PI name changes associated with changes in marital status.

At this point, I am down to 178,122 unique ID numbers and I expect this number to fall further. While this has been a great exercise in learning R as well as examining creative practices in data entry (I did not previously know that NMN would entered in some cases where an individual gives No Middle Name), I am ready to finish up this stage and get on with more interesting analyses. But, with "data science" as with other types of science, time spent checking the validity of raw data before other analyses are done is time well spent.

12 responses so far

First Outstanding Investigator (R35) Awards from NCI

(by datahound) Aug 12 2015

The R35 mechanism is emerging at NIH as a mechanism for providing more stable (i.e. longer-term and for research programs rather than projects) support for selected investigators. The first R35 program out of the box was the NCI Outstanding Investigator Award, followed by the NIGMS MIRA Award. NINDS has also recently announced an outstanding program as well.

The first 17 R35 awards from NCI appeared in NIH RePORTER recently. These investigators cover the NCI mission fairly well (biology, genomics, surveillance, prevention including behavior, treatment). These investigators also have a wide range of funding with core support for FY2014 ranging from $230 K annual total costs to $5.8 M with a median of approximately $700 K total costs (although these values are somewhat subject to judgment since considerable support comes from P30 Cancer Center grants and program project grants (P01s)). I tried to provide lower estimates. The investigators are relatively diverse with regard to age with estimated ages ranging from 41 to 74 with an estimated median age of 56. The initial group includes 13 men and 4 women.

More awards are appearing in RePORTER; 4 additional awards have appeared since I did this initial analysis so expect updates.

6 responses so far

Percentages of Faculty Salary Support at Academic Medical Centers

(by datahound) Jun 17 2015

There has been much discussion of the percentages of faculty salaries coming from internal versus external sources. In the context of helping prepare a recent paper from leaders of academic medical centers, I was able to obtain some data from the AAMC (American Association of Medical Colleges) regarding the distribution of levels of extramural support across 72 academic medical centers for 2013. These data are shown below:

STL Figure


These data were collected under terms of strict anonymity for institutions. Furthermore, as noted in the caption, they were collected by obtaining the total amount of extramural support going to faculty salaries and dividing by the total amount going to salaries for individuals with at least some extramural support. Thus, distributions of levels of support across a given institution are not available. Nonetheless, these distributions provide some sense of the range of individual institutional behavior that is more informative than an overall median with no other information.

26 responses so far

IC Distributions for R01s from PIs with Multiple R01s

(by datahound) Jun 08 2015

In my previous post, I examined the fraction of NIH PIs who had either a single R01 (or R37 Merit Award) or multiple R01s for fiscal year 2014. Overall, about 30% of R01 PIs had more than 1 R01. In the comments and on Twitter, the issue came up about whether those with multiple R01s had them from the same IC or from multiple institutes.

To address this question, I asked the question: If an PI had an R01 from one institute, what is the distribution of ICs for the additional R01s going to the same PI. The results are tabulated below:



Overall, the percentage of those additional R01s coming from the same IC ranges from 47 to 75%. For those that do not come from the same IC, the number of ICs contributing substantially ranges from a few to many illustrated below (which depicts the data above displayed as the fraction of the R01s from the different ICs given an R01 from a particular IC).

Mult PI IC Graph

For example, if a PI has one Ro1 from AA (NIAAA), 61% of additional R01s come from AA and 18% come from DA (NIDA), leaving 21% for the remaining ICs. In contrast, if a PI has a grant from GM (NIGMS) or CA (NIH), it takes 4 additional ICs to reach 18% of additional R01s.

Which ICs are linked by having PIs with multiple R01s? I examined the top two contributions of additional R01s for each IC (in addition to the IC itself). In these "top two lists", I joined the pairs of ICs. I used a bold line if the link was bi-direcctional, that is, each PI appeared on the top two list of the other. The results are depicted below:

IC-IC graph-2-rev


Overall, the patterns that emerge are as might be anticipated. The bidirectional links are between AA-DA, MH-NS, DK-HL, CA-GM, and CA-AI. Some of the larger ICs are linked to many other ICs, reflecting both their size and their relatively broad missions.


As noted in the comments, some of these connections could be attributed to the size of the ICs. Thus, NCI appeared to be linked to many other ICs, but this could be due to the large number of R01s awarded by NCI rather than by actual content overlap.

To address this, I simulated results assuming that the probabilities for an additional grant coming from a particular IC was proportional to the number of grants that this IC award in this data set. I then compared the simulated results with the actual results. Of course, the number of grants going to the same IC was much higher than would be expected. Since this distorted the other statistics, I set all of these values equal to 0 and re-simulated the data. I (or, more correctly, R) performed 1000 simulations and then calculated mean, standard deviation, and other statistics for these distributions of grant numbers. I then compared these with the actual values observed in the data. The results (presented a log(base 10) of the probability of occurring by chance are presented below:



These results allow assessment of the strength of the interactions corrected for IC size.

The strongest interactions are between NIDA and NIAAA with probabilities of occurring by change of < 10^-88.

The other strong interactions are:


NIAMS and NIDCR (which was still detected previously even though these are both relatively small ICs)




NIDCD and NEI (which was not detected previously)

The link between NCI and NIGMS is still the strongest link between NCI and another IC, but it is substantially less pronounced that the other links above.

Thanks for the comments. I think this a much improved analysis and I had an excuse to explore additional R tools.

I am now working on generating a 2-dimensional figure that is more consistent with these connectivities in a more formal way.

17 responses so far

Single vs Multiple R01 Holders by IC

(by datahound) Jun 05 2015

On a recent Drugmonkey post on the new NIGMS MIRA Award announcement, a commenter suggested that NIGMS might have more R01 PIs with more than 1 R01. With my new R tools, it was relatively straightforward to check this.

Below is a table with the number of PIs (not counting multiple PIs in this analysis) from each IC who have 1 R01 or more than 1 R01 for fiscal year 2014 (R37s are also included). The abbreviations for the ICs are shown with the IC number. Note that the additional R01s can be from the same or a different IC.



As can be seen, NIGMS (GM) is actually slightly below the median (not weighted by the number of PIs) of 0.304 and below all of the other large ICs (CA, AI, HL).

Other queries welcome!

Updated:  I discovered an error in the table that I originally posted. A revised table is included. None of the conclusions were affected.

15 responses so far

R01-equivalent PIs: 1985-2014

(by datahound) May 28 2015

I recently posted data on the number of unique NIH PIs for all mechanisms listed in the NIH RePORT database.

I have now analyzed data for R01-equivalent grants (primarily R01s but also R23, R29, and R37 (MERIT) awards) as shown below:

R01 PI plot

This shows curves for all PIs (including multiple PIs) and for Contact PIs only. These curves clearly reveal the impact of the NIH budget "doubling" from FY1998 to 2003) and the subsequent decline due to the worse-than-flat NIH budget over the past 12 years (with the exception of the ARRA) funding.

The correction for multiple PIs is significant (although, of course, being PI on a multiple PI grant likely provides fewer resources than being the sole PI on an award of the same size). The 3564 New (Type 1) R01 grants in FY2014, 771 had multiple PIs.

8 responses so far

The Number of NIH PIs 1985-2014: The Effect of Multiple PIs

(by datahound) May 28 2015

I recently posted a somewhat startling curve showing the total number of NIH contact PIs for all mechanisms in the NIH RePORT database. This showed a drop in the total number of PIs from FY2010 to the present.

As I lay awake thinking about this curve and what might mean, I thought it might change somewhat if I included all PIs instead of just Contact PIs. Recall that the NIH multiple PI policy only went into effect in around 2005.

I was able to examine this point relatively quickly. The results are shown below:

NIH PI Plot wNonContact


This shows that the inclusion of all PIs decreases the magnitude of the drop since FY2010.

Some other interesting statistics about non-Contact PIs are:

Total Contact PIs:  216,521

Total PIs listed as other than Contact PI:  11,504

PIs who have never been Contact PI:  2,873


10 responses so far

Analysis of Subsequent Years of K99-R00 Program

(by datahound) May 28 2015

I had previous done some analysis of the NIH K99-R00 program for the first two cohorts.  I wrote R scripts to assemble information about the R00 and R01 (as well as DP1 and DP2) awards subsequently obtained by K99 recipients and to analyze these results. I included precise grant start and end times rather than simply fiscal years as I had done in my initial analysis.

The results for the first K99 cohort (from fiscal year 2007) are shown below. This shows the number of investigators (out of 182 initial K99 awardees) who had K99 awards, R00 awards, or R01 (or DP1, or DP2) awards aligned with the start dates for the initial K99 award at time 0.

2007 K99 Cohort Plot-3

This shows that more than 90% of these K99 awardees transitioned to the R00 phase and that more than 100 of these PIs had obtained at least one R01 (or equivalent) award as shown previously but now with more precision about the timing of these awards.

With these scripts in hand, it was straightforward to analyze subsequent K99 cohorts. The results are shown below:


K99 Awards Plot


This graph reveals that the overall pattern for the K99 phase is remarkably consistent from year to year, with substantial transitions at the end of year 1, a steady decline and then a sharp drop at the end of year 2, and the remaining ~20% of PIs transitioning off the K99 by the end of year 3.

The results for the R00 phase are shown below:

R00 Award Plot


Again, the pattern is quite consistent. The fraction of K99 awardees who have transitioned to the R00 phase is approximately 50% at the end of year 2 (since the start of the K99 award) and peaks at between 80 and 90% in the middle of year 3. The curves are different for the FY2010, FY2011, and FY12 K99 cohorts since they have not yet had time to fully transition, but the curves look quite similar for the regions that overlap the other curves.

The final curve shows the transition to R01 awards (I also included DP1 (Pioneer) and DP2 (New Innovator) awards).

R01 Award Plot-2


Here, the curves are more different. For the first (FY2007) cohort, more than 50% of the K99 awardees have transitioned to R01 funding. More than 40% of the FY2008 cohort have transitioned, but comparison of the FY2007 and FY2008 curves suggests that this cohort is transitioning more slowly or will not achieve the same level of the FY2007 cohort. This trend continues with the FY2009 cohort. Of course, these attempted transitions to R01 funding are occurring over the period where the overall number of NIH supported PIs dropped (as revealed in my previous post). The FY2010 cohort showed an initial burst above the FY2008 and FY2009 curves but has slowed since then. It is too early to say much about the FY2011 and FY2012 cohorts.

The ability to analyze these data in kinetic detail with relative ease allowed some comparisons that were much harder to make in my previous analysis. I am impressed with the continuing development of R by a large open community (especially Hadley Wickham) that are making R an ever-more-powerful tool.

14 responses so far

Analyzing NIH Data with R

(by datahound) May 28 2015

Most of the analysis of NIH data that I have done with NIH data has been done using Excel. While Excel does have some useful features, it has many limitations. My son who, as an actuary, does considerable data analysis for a living, urged me to migrate to a more powerful platform, R, for my analyses. He can be quite convincing and I have spent time over the past month developing some rudimentary R skills (in part through an on-line course). I am now fully convinced that he was right.

I downloaded all of the data used by NIH RePORTER (from NIH ExPORTER) and wrote R scripts to parse the data into a forms that could be easily analyzed by R. The full file has 1,907,841 grant records with readable contact PI numbers for fiscal years 1985 to 2014. These correspond to 216,521 unique contact PIs.

As an initial exercise with these data, I decided to plot the number of unique contact PIs as a function of fiscal years. The result is shown below:

Unique PI Plot-2


What I attempted as a test of my data analysis skills revealed a striking result. The number of unique contact PIs had grown almost linearly from 1985 to about 2009-2010 (the ARRA years) but subsequently dropped quite sharply from 2010 to 2014. This graph provide much clearer evidence for "the cull" than I anticipated.

Despite this bottom line, considerable work remains to be done to probe this further since this includes a wide variety of mechanisms. With the powerful file manipulation and analysis tools in R, this should be relatively straightforward.

Let the analysis begin!

33 responses so far

Please comment: NIH RFI on "Optimizing Funding Policies..."

(by datahound) May 07 2015

NIH released an RFI on April 2 on Optimizing Funding Policies and Other Strategies to Improve the Impact and Sustainability of Biomedical Research. Responses are Due by May 17th (10 more days).

Please take the time to go and provide input. My recent post on the potential emeritus award RFI should make it very clear that your input is necessary if you don't want the response to be dominated by those with quite different perspectives from yours.

Here is the link for the RFI and the comment areas are listed below to get your thinking started.

Please limit comment to a maximum of 500 words.

Please limit comment to a maximum of 500 words.

Please limit comment to a maximum of 500 words.

Please limit comment to a maximum of 500 words.
Now's your chance. There is really no excuse for not contributing your thoughts.

7 responses so far

Older posts »