Analyzing NIH Data with R

May 28 2015 Published by under Uncategorized

Most of the analysis of NIH data that I have done with NIH data has been done using Excel. While Excel does have some useful features, it has many limitations. My son who, as an actuary, does considerable data analysis for a living, urged me to migrate to a more powerful platform, R, for my analyses. He can be quite convincing and I have spent time over the past month developing some rudimentary R skills (in part through an on-line course). I am now fully convinced that he was right.

I downloaded all of the data used by NIH RePORTER (from NIH ExPORTER) and wrote R scripts to parse the data into a forms that could be easily analyzed by R. The full file has 1,907,841 grant records with readable contact PI numbers for fiscal years 1985 to 2014. These correspond to 216,521 unique contact PIs.

As an initial exercise with these data, I decided to plot the number of unique contact PIs as a function of fiscal years. The result is shown below:

Unique PI Plot-2


What I attempted as a test of my data analysis skills revealed a striking result. The number of unique contact PIs had grown almost linearly from 1985 to about 2009-2010 (the ARRA years) but subsequently dropped quite sharply from 2010 to 2014. This graph provide much clearer evidence for "the cull" than I anticipated.

Despite this bottom line, considerable work remains to be done to probe this further since this includes a wide variety of mechanisms. With the powerful file manipulation and analysis tools in R, this should be relatively straightforward.

Let the analysis begin!

33 responses so far

  • Comradde PhysioProffe says:

    "my son, who as an actuary"

    Dude, that shit is clearly genetic.

    • datahound says:

      Our other son is a lawyer. I guess I passed along my FOIA tendencies. Our daughter is training to be an engineer. I hope she can help come up with solutions.

  • lurker says:

    Have fun, DH/JB! Looking forward to you resuming your posts. I would be very interested to see where the "Great Cull" as I've heard it call is falling on whom generationally. Boomers taking the hint? Mid-careers getting squeezed out? Noobs fizzling into oblivion?

  • Josh W. says:

    Great post. Out of curiosity, what online resource would you suggest for R? I learned the basics a few years ago but have lost most of my skills from lack of use and I need to brush up now. Thanks for all that you do for the community!

    • datahound says:

      I have been taking an on-line course from Coursera that is part of a "Data Science" program through Johns Hopkins School of Public Health. They have a package called "Swirl" that is fairly helpful. There is also an R package called DPLYR that is very powerful and intuitive.

  • Namaste_Ish says:

    Typically, when someone posts something that is traumatic, they will give a #TriggerWarning as a means to let people know they are going to be crushed. Just a way you might want to consider interacting with your followers in the future before dropping these bombs.

    • datahound says:

      I do try to give warnings when I got them. In this case, I was just testing my script, ran the graphing tool without looking at the data and went "Oooooffff!"

  • NewInvestigator says:

    What happened in 2003?! I'm too young to remember that being a traumatic year.

    • datahound says:

      I have been wondering the same thing. That was the last year of the doubling. I am looking into it.

      • MC83ny says:

        Anecdotally, I have been told that with the doubling, finding went almost entirely to established labs (swelling their numbers of postdoc) rather than to new investigators. If that person was right, then that could potentially be what's reflected here?

        • drugmonkey says:

          Do you mean ARRA?

          • MC83ny says:

            No, the doubling of the NIH budget was what the person I heard speak was talking about. They are working on looking at the post doc problem (perma-postdocs, people being unable to secure faculty positions), and they referenced the doubling of the NIH budget back in the early 2000s as being a contributing factor to the postdoc culture we have now - because when the budget was doubled, the majority of the money went to giving more money to already existing PIs, rather than funding the work of new PIs.

            This is something someone else presented, so I don't have any research to back it up, but this chart seems to verify what they were saying.

  • Vivek says:

    Datahound - would be good to see the Rcode.


  • Philapodia says:

    DH, you need to publish all of these analyses that you've been doing! It's invaluable for the whole scientific community, but I doubt that most of the scientific community putters around scientopia and sees them. Then maybe the Jedi Council (and everyone else) will finally understand what's been going on under their noses for years.

  • […] are occurring over the period where the overall number of NIH supported PIs dropped (as revealed in my previous post). The FY2010 cohort showed an initial burst above the FY2008 and FY2009 curves but has slowed since […]

  • […] recently posted a somewhat startling curve showing the total number of NIH contact PIs for all mechanisms in the […]

  • […] key analyses on the state of the NIH-funded extramural work force. In the first one he presents the number of unique PIs from 1985-2014. It looks to me, roughly, that there are about 18% fewer PIs than the peak and approximately 10% […]

  • DJMH says:

    Drugmonkey must be feeling pretty vindicated right about now. Nice analysis.

  • drugmonkey says:

    DH- are you able to code by the first FY in which the PI appears? It won't be perfect due to the start of your dataset and NI type issues, but it might be an approximation of age cohort......

  • drugmonkey says:

    I am not "vindicated", DJMH. Why would you think that?

    • DJMH says:

      Because you've been talking about The Cull and trying to get at this question by less direct means, and now there are numbers. And there is a cull.

      • drugmonkey says:

        Hmm. Well I certainly didn't coin the term and I've mostly talked about how reducing the numbers of PIs is the only way we will get lasting change in the grant pressure cooker. I'm also one that argues that for NIH to let a cull go unmanaged is unwise. As we see from the K99 numbers DataHound posted.

  • qaz says:

    Can you measure this by level somehow?

    Say by some code of position? Or by age? Or by date of PhD (is that coded somewhere)? If you can't, then could you do it by date of first R01 (as suggested byDM)?

    I'd really really like to know where the cull is hitting.

  • Tami says:

    Any chance you can share a link to the full file so that others can easily try some similar analyses?

  • becca says:

    ... and Jeremy Berg singlehandedly solves the unemployed PhD problem.

    Granted, it was by rendering us all depressed-as-heck and motivated to learn R to go become data wonks, but hey any progress counts.

  • DJMH says:

    As long as NIH hires him back with a giant office he can fill with data-hacking Nate Silver-wannabes, we are golden!

  • eeke says:

    DH - can you clarify if this chart shows cumulative numbers? That is, if the software recognizes a "unique PI" in 1985, this person's name is included in every subsequent year until they aren't there anymore? Or maybe return (an analysis that you had described some time ago)?

    I would also be very interested in working with the code or files you have if there is any way you could make them accessible. I'd like to do an analysis using key words that would show some sort of profile for my own research field.

    In any case, thanks so much for this.

  • […] research? Two new analyses show that both appear to be true. DataHound has been using R to sort out whether the number of unique PIs declined from 1985-2014 and whether having a K99 gave postdocs any funding traction in subsequent years. He […]

  • […] announcement, a commenter suggested that NIGMS might have more R01 PIs with more than 1 R01. With my new R tools, it was relatively straightforward to check […]

  • […] my first post using R to analyze NIH data, I examined the number of unique investigators funded by NIH per year as a function of time. The […]

Leave a Reply