Data Literacy, Data-Mining, Librarianship, Research

Research updates, and how I geeked out over purposive sampling

In my last few posts, I’ve eluded to the process I’m currently steeped in regarding my current research project on data literacy in biomedical research environments. I documented a bit of my writing process for my updated proposal on a blog created as part of my participation in the Institute for Research Design in Librarianship (IRDL) and have come leaps and bounds since my last update.

First, the proposal got finished! Look at it if you feel like reading 10 pages of justification and logistics. And yes I got it down to ten pages :) It is soooo much better than my first attempt.

MUCH better proposal

You will notice though, in this proposal (that I finished prior to submitting my study to the IRB) there are details missing regarding sampling strategy and timeline for the study. These details were contingent upon the acquisition of funding to make use of a transcription service for interviews, and the development of a workflow to identify researchers who would fit into my proposed study population. Since completing the above proposal for IRDL though, both issues have been resolved– first I applied for and received my research grant from the NY/NJ Chapter of the Medical Library Association! You can read my successful application here. The funding will allow me to have a much faster turn around time between data collection (interviews) and the analysis/synthesis phase of qualitative arm of my study. Having secured this funding I am confident in the below diagram pretty accurately representing the expected timeline:

Data Literacy Study Timeline


As far as recruitment methods go (currently in the recruitment phase!) after working closely with my co-investigators (Dr. Stephen Johnson and Dr. Joshua Richardson) we have decided to take a very systematic approach to identifying researchers for the study. You see, WCMC is a giant place, and identifying researchers who work with clinical data, but not doing clinical trials research, who have currently active federal grants, and are full-time faculty members with Cornell, is not easy. We settled on this group as we felt they would best represent the research being done at WCMC, but it’s difficult to know who to solicit to given the complexities of the college/hospital’s infrastructure (Just determining how many departments/centers/etc. we have is a challenge). So we agreed that the best way to come up with a population from which to sample, we would need to use all available institutional data sources and to cross reference them to find where all conditions were met:

Screen Shot 2014-10-27 at 4.01.40 PM

To do this, I got access to and pulled data from our institutional grants tracking database (this took a few weeks) to get data on who has federal grants, and the date ranges associated with their applications and renewals (along with the network IDs of each researcher). I then combined that data with data I pulled from our faculty affairs database to get everyone’s authoritative titles (with more than one appointment this is challenging) and their primary departments. I was also able to pull their educational backgrounds giving me information regarding what types of degrees these researchers have. I then cross referenced this set with a dataset I pulled from our researcher profiling system VIVO where I had constructed a query to *hopefully* (has face validity) pull out researchers who work with patient data but not doing clinical trials. Why not just use VIVO for all the data you ask? Because VIVO is still in development and unfortunately not all the data is easily scraped or validated. Once I cross referenced all of these datasets and removed administrators, fellows, and department chairs who likely won’t give me the time of day (built a simple SQL based DB for this in Querious and canned my queries) I generated a report consisting of 62 researchers who fit my criteria. If I remove full professors (as they might also not give me the time of day) I would be down to 40 individuals to sample from. I am meeting with my collaborators next week to determine which population we’re going to reach out to. Either way I will send them my email templates (which I’ll tailor some to make them more personal) and hope they let me interview them and some members of their research teams.

Side note* I wanted to see if I could narrow these results further using network analysis. I pulled all (disambiguated) publications from VIVO and crossed it with my current purposive sample. From the resulting list of publications I extracted co-authorship networks to try to see if I could readily identify any teams I could target for interviews. Below you can find a (VERY simple) network I created using a Kamada-Kawai un-weighted force directed layout of the co-authorship tendencies of researchers identified in my sample — note all labels are removed because these people are potential research subjects. I hope to explore this method further but will need to consult with my co-investigators as to whether or not it is advisable to limit my population further in this way. Either way we plan to write a brief methods paper about using data pulls from various institutional systems to identify a purposive sample.


Screen Shot 2014-10-27 at 3.17.42 PM


So I think I have my sampling problem mostly solved :) The rest of my time on this project has been spent developing my interview guide (MAKING IT SHORT) in such a way that it pulls the relevant data I need to construct my follow up survey. I’m also probably going to make a poster for RDAP on creating a streamlined data literacy assessment tool aimed at identifying social and technical obstacles impeding data literacy in biomedical research centers. The interview protocol is almost done and will be distributed post study.

This whole process has basically been a combination of excellent fulfilling excitement in seeing my ideas actually start to come into reality, and an emotional no-good-very-bad headache inducing mess to try to communicate to others the purpose of this study and to get the IRB protocol submitted. Feels like falling down a flight of stairs that has cake at the bottom… Can’t wait to see how it all turns out.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.