The Demise of 38,000 NIH-funded Investigators

Aug 19 2015 Published by under Uncategorized

In my first post using R to analyze NIH data, I examined the number of unique investigators funded by NIH per year as a function of time. The definition of "unique PIs" was based on the number of unique "Contact PI Person ID" numbers in the NIH RePORT database from 1985 to 2014. Overall, this number was 216,521.

As I prepared my data set for more analysis, I discovered that some investigators had more than one Contact PI Person ID number. I have spent the past 2 months trying to sort this out and I am still not done. An investigator in the intramural program has well over 1oo ID numbers over time! Getting this sorted out is crucial for future analyses, particularly longitudinal ones that are so important. Otherwise, an investigator might appear to have a gap or termination in funding just because their ID number changed.

In addition, there are problems the other direction with multiple names associated with one ID number. A very small number of these appear to be cases where different people have been assigned to the same ID number. Most are related to non-uniformity in how names are entered (e.g. with or without a middle initial, with or without a period on the middle initial). Some are good to have been captured such as PI name changes associated with changes in marital status.

At this point, I am down to 178,122 unique ID numbers and I expect this number to fall further. While this has been a great exercise in learning R as well as examining creative practices in data entry (I did not previously know that NMN would entered in some cases where an individual gives No Middle Name), I am ready to finish up this stage and get on with more interesting analyses. But, with "data science" as with other types of science, time spent checking the validity of raw data before other analyses are done is time well spent.

15 responses so far

  • Pinko Punko says:


    Sadly, I feel when you get to the clean data, it will just be depressing.

    • datahound says:

      PP: It depends on the questions being asked. I am interested in getting substantial data to characterize how different cohorts of PIs have fared over time. Yes, I am sure some of this will be depressing, but it will also be useful to have some well supported facts to figure out what needs addressed most urgently.

      • Pinko Punko says:

        Oh, I definitely agree. I blew it on the recent NIH RFI about their strategic plan. The strategic plan mentioned basic research and long term unexpected payoffs for such, but much of this type of research is considered incremental or less significant now because outside of being normal science, it is hard to justify as significant because immediate impact isn't clear. Things like "these RNAs seem really interesting, but they are only present in x… or their functional roles are not clear…" That sort of thing. I am biased because I work on fundamental, central dogma type stuff, but to me this stuff - whether in be in weird, odd organisms, or in model systems where incredible detail is accessible - has really been useful in long term for science, and I think this sector is really taking a hit- especially early to mid career folks. It is already obvious based on greater fraction of dollars going to senior investigators.

        It is a mess. I suspect the data will indicate greater shrinkage in some sectors over others.

        I wanted to also ask you if you have seen the graph going around indicating increasing numbers of authors on papers? Well, that graph also shows a dip in total scientific papers in the last 3-4 years. If those data are accurate, it could be stark evidence of contraction.

  • DJMH says:

    click-bait headline! and it worked, of course.

    Looking forward to seeing the analyses that come out...

  • Established PI says:

    Is this an old problem or an ongoing problem? If the latter, there will be no hope of implementing the proposal that NIH track all postdocs via unique identifiers.

    • datahound says:

      Both, but I think a substantial amount of the problem comes from the fact that no one knew why this was important. I would hope that NIH would be careful when they know the purpose. Also, it is not hard to fix with an ongoing effort (i.e. checking annually that ID numbers are correct).

  • Dave says:

    I'm slowly going through the Hopkins data science track, and coming from zero programming background, R has a very steep learning curve indeed. I applaud your efforts on such a large dataset, but I know R is your only shot at sorting it out. Good luck!

  • SaG says:

    A big BOW WOW WOW to Datahound. That is a lot of Work! When you are done with this can you go clean up the AshleyMadison data dump? 😉

  • Philapodia says:

    Do you happen to know if OER sanitizes their inputs this way for the data that they have presented in the past, or are their analyses inaccurate due to this confused data? I'm wondering how rigorous they have actually been when trying to understand the system.

  • Namesaste_Ish says:

    would you like to co-blog DH and you can deliver the depressing as fuckke news and I can insert some happy kitten jifs and what not?

  • Vanguard says:

    Dear Datahound,

    What % of grants are awarded to NIH reviewers?

    • datahound says:

      I do not know, but many reviewers are chosen from individuals who have received grants. Many grants to new investigators likely go to those without NIH review experience.

      • PO says:

        My guess is that the percentage is very high since success at getting an NIH grant makes you "peer" of those sending in grants. Though there are probably many more peers of unsuccessful applicants.

Leave a Reply