In my first post using R to analyze NIH data, I examined the number of unique investigators funded by NIH per year as a function of time. The definition of "unique PIs" was based on the number of unique "Contact PI Person ID" numbers in the NIH RePORT database from 1985 to 2014. Overall, this number was 216,521.
As I prepared my data set for more analysis, I discovered that some investigators had more than one Contact PI Person ID number. I have spent the past 2 months trying to sort this out and I am still not done. An investigator in the intramural program has well over 1oo ID numbers over time! Getting this sorted out is crucial for future analyses, particularly longitudinal ones that are so important. Otherwise, an investigator might appear to have a gap or termination in funding just because their ID number changed.
In addition, there are problems the other direction with multiple names associated with one ID number. A very small number of these appear to be cases where different people have been assigned to the same ID number. Most are related to non-uniformity in how names are entered (e.g. with or without a middle initial, with or without a period on the middle initial). Some are good to have been captured such as PI name changes associated with changes in marital status.
At this point, I am down to 178,122 unique ID numbers and I expect this number to fall further. While this has been a great exercise in learning R as well as examining creative practices in data entry (I did not previously know that NMN would entered in some cases where an individual gives No Middle Name), I am ready to finish up this stage and get on with more interesting analyses. But, with "data science" as with other types of science, time spent checking the validity of raw data before other analyses are done is time well spent.