Daina Bouquin

Data Geek. Librarian. Dangerous Lady.

Ice, Software Carpentry, and Data Analytics

Ok, I have a bad habit of disappearing for long intervals (from my own site!) but I promise it’s just because I’ve been so busy! Not because I don’t like writing here. I really do like writing here.

I have some excellent things to share briefly though and figured even if it’s just a taste of what I’ve been doing, it’s good to put these things down somewhere. There are some less awesome things to write about, but for now here’s the highlights. The biggest things I’ve been up to lately:

I won 3 events in the North American Winter Swimming Championship which was held in New Port, VT back in February: 25, 50, 100 m Freestyle


This is me :)

Got my picture on the homepage of NPR with the story too :)


Also successfully obtained funding for and organized a Software Carpentry Workshop to be held at Weill Cornell. It took months and months but it’s all finally paying off next week! Check it out here. Funding was provided by the Weill Cornell Graduate School of Medical Sciences, and both the Applied Bioinformatics Core and Institute for Computational Biomedicine have been giving me excellent organizational and logistical support along the way. It’ll be free to all attendees.

Screen Shot 2015-03-26 at 2.08.06 PM



In other news, I got accepted into the CUNY School of Professional Studies’ MS program in Data Analytics! This will be it’s own post soon I’m sure. I’m nervous but very excited to start in on this. I had to take an exam that covered topics in the following:

  • Statistics and probability including descriptive statistics, skewness/kurtosis, histograms, statistical error, correlation, single variable linear regression analysis, significance testing, probability distributions, and basic probability modeling;
  • Linear algebra including basic matrix manipulation, dot and cross products, inverse matrices, eigenvalues, representing problems as matrices, and solving small systems of linear equations;
  • Programming in a high level language (e.g. Python, Java, JavaScript, C++, C, Ruby, SAS). Coding from scratch;
  • Relational databases including connecting to and manipulating data, working with tables, joins, basic relational algebra, and SQL queries.
  • Analytical thinking including the ability to translate real-world phenomena into quantitative representations and, conversely, the ability to interpret quantitative representations with practical explanations.

Analytics Exam fun

I don’t have a formal math or computer science background so I have had to teach myself A LOT as I’ve come down the data science path. I’m continuing in this fashion as I pursue online coursework in web application frameworks– I’m working on these online courses:

Web Application Architectures – University of New Mexico via Coursera

Ruby on Rails – Udemy

Full Stack Web Development – Udemy


Meanwhile, I have been writing blog posts for the e-Science Portal over the last few months (check them out here) and also published my first peer-reviewed article in the open source journal In the Library with the Lead Pipe, which you can see here– I got to work with some fantastic librarians on this one so please do give it a read!


And my husband and I adopted a new kitten.10959405_10203405359919458_5042590917317009510_n

His name is Herald and he can see a little bit out of his one eye.


And in case that last part slipped by you, I got married! It’s been such a busy few months, but I really can’t complain.

Owen and Daina

Brighton Beach, Brooklyn. Dec. 13, 2014



Learning New Skills: Git and Web Development

The other day, I wrote another post for the e-Science Community Blog so I figured it wouldn’t hurt to share it here too. You can read the original version of this most recent post, on the portal here

And counter to the guidance I site below to be “single-minded” in my endeavors to learn new skills, as the weather has gotten colder, I have begun eating lunch at my desk– as such that means I’ve got about an hour in the middle of the day where I’m trying to plug away a little more on the skills I’m trying to learn. Namely, I’ve started a Web Development course offered through Udemy :) Hopefully in about 2-3 months I’ll have worked through it all and the ProGit book I reference and have a whole new pile of tricks up my sleeve! Wish me luck. Basically I’ve told myself I need to stay rigorous and as often as possible do the web dev lectures over lunch, and the git book in the evening when I have a bit more time.

And just so you don’t think I don’t do anything else with my time, on the train, lately I’m reading  too many poorly written murder mysteries like this one (can’t not read). I’ve also been swimming in the cold cold ocean :) This pic was taken at 6am just off Coney Island last Friday. The air temp was about 46 and the water temp was 53.


I’m the one in the middle. The lovelies with me are wonderful people.

But now, on to the blog post!

Learning Git

Submitted by guest contributor Daina Bouquin, Data & Metadata Services Librarian, Weill Cornell Medical College of Cornell University, dab2058@med.cornell.edu

The role of the data librarian extends far beyond helping researchers write data management plans. Rather, librarians working where data-intensive science is happening spend their time answering questions about the entire data life cycle—data pre-processing, analysis, visualization and data validation are all important, and sometimes highly intricate, parts of the research process. As a data services librarian I have personally found myself advising researchers to rework their workflows to make use of tools they have available to them help make their research more replicable, efficient, and shareable at these various stages of their research process. Unfortunately though, I do not always have hands-on experience with the tools and techniques I’m advising researchers to use– nor is it possible for me to always have experience using every tool out there available to researchers in computational environments. However, I do believe it’s important for me to get as much hand-on experience as possible with the most useful, commonly used tools, so that I can develop both refined expertise in my field, and also empathy for my patrons. E-Science Portal editor Donna Kafel recently wrote a wonderful post where she reflected upon, and pulled advise from others about self-learning and the challenges associated with it. Here, I aim to outline how I’m making use of some of the excellent advice offered in that post, while focusing in on an area of the data life cycle that I believe is sometimes oversimplified in discussion—I’m referring to the version control processes inherent in good data management.

Be single-minded. Identify one topic or skills you want to learn and focus on mastering it.” – Donna Kafel, Challenges of Self Learning

I decided the advice I would take to heart most fiercely from Donna’s self-learning post was the above take-away. It rang true with me because I regularly encounter problems by trying to tackle too many new topics at once. If I don’t use something regularly, it’s difficult for me to become proficient—especially with technically challenging tools. It makes sense that I should focus more on mastering a single skill before moving on to anything new, but how to choose what to focus on? This is where Version Control Systems (VCS) or “Revision Control Systems” come in. VCSs are incredibly diverse in both complexity and application, and while I rarely see them discussed at length by librarians, I find them to be exceedingly important to researchers in collaborative environments. I regularly read discussions on file naming as an approach to control versioning and to aid researchers in a multitude of data management processes, and I do not want to discredit that discussion because it is so important (check out some of thegreat writing on this topic right here on the portal blog!), but I’m hoping to extend that conversation a bit more in this post. Below I focus in on Git as both a self-learning opportunity and incredibly useful VCS.


Git is a technology that “records changes to a file or set of files over time so that you can recall specific versions later”1. You can use Git for just about any type of file, but it is primarily used by people working with code files. Often times, people use simpler version-control methods, like copying files into a time-stamped directory, but this tactic is risky—one could forget which directory files are stored in or accidentally write over the wrong file (file naming helps here), but an even better approach is using a tool like Git. 1

Git is what is called a Distributed Version Control System (DVCS), but it is easier to understand DVCS if you first understand Centralized Version Control Systems (CVCS). CVCSs have a single server that contains all the versioned files a group of people are working on. Individuals can “check out” files from that central place so everyone knows to some extent what other people on the project are doing. Admins have control over who can do what so there is some centralized authority making it easier to manage than local version control solutions. Examples of CVCSs include the popular Apache tool Subversion1

There are though some drawbacks to using a CVCS—namely, the single server situation. If the server goes down, not only can no one can make any changes to anything that’s being worked on, but if the server gets damaged and is corrupted, the individuals working on the project are completely reliant on there being sufficient backups of all versions of their files. This is again, quite risky.

To mitigate this problem, DVCSs were developed. In distributed systems (like Git) people do not just check out the latest version of a file, they completely “mirror” the repository. In this way if the server dies, anyone who mirrored the repository can copy back to the server and restore it. Every time someone checks out a file, the data is fully backed up

Distributed systems are also capable of working well with several remote repositories at once, allowing people to collaborate with multiple groups in different ways concurrently on the same project. 1

However, I did not decide to focus my single-minded self-learning on Git just because it is so useful for version control—I wanted to focus on learning as many skills as possible, while still staying focused. You see, in learning to use Git, I’d have more opportunity to learn about Bash Unix Shell. Having some background in using command line interfaces, I am still a beginner with the Terminal and figured that learning Git would get me much more proficient with navigating my computer via the command line, which in-turn could help me get up the confidence to learn how to use a Linux operating system. Learning Git would also help me learn how to use GitHub, which is growing by the day in popularity as a place for people to store and share code. The GitHub graphic user interface would also help get me off the ground. So I found Git to be the great door-opener to many other skillsets on my list of self-learning goals.

Thus, I have begun learning to use Git and GitHub. I was able to get some hands-on experience with it by participating in a Software Carpentry Bootcamp this past summer, but didn’t find the time to dedicate to following up on it– I was not staying focused on learning a single new skill. So now I am re-grouping. I have primarily been using the resources I am providing below, however there is so much more out there. These resources are just a great place to start, and having made some headway in my own reading of these documents I hope to be trying out Git more in the very near future.

Pro Git Great free eBook and videos on getting started with and better understanding Git and version control. I used this excellent book in writing this post.

Pro Git Documentation External Links Tutorials, books and videos, to help get you started.

Even if you don’t think learning to use Git is right for you, learning more about the tools researchers are using to work with their data and getting a look under the hood about how those technologies work can be a great way to continue to grow professionally. I hope you all have the opportunity to join me in exploring a new skill and share your experiences with the e-Science Portal Community.


1. Chacon, S. (2014). Pro Git. Berkeley, CA: Apress. http://git-scm.com/book/en/v2

And just incase you weren’t already overwhelmed, here’s a great TED Blog on places to learn how to code!

Research updates, and how I geeked out over purposive sampling

In my last few posts, I’ve eluded to the process I’m currently steeped in regarding my current research project on data literacy in biomedical research environments. I documented a bit of my writing process for my updated proposal on a blog created as part of my participation in the Institute for Research Design in Librarianship (IRDL) and have come leaps and bounds since my last update.

First, the proposal got finished! Look at it if you feel like reading 10 pages of justification and logistics. And yes I got it down to ten pages :) It is soooo much better than my first attempt.

MUCH better proposal

You will notice though, in this proposal (that I finished prior to submitting my study to the IRB) there are details missing regarding sampling strategy and timeline for the study. These details were contingent upon the acquisition of funding to make use of a transcription service for interviews, and the development of a workflow to identify researchers who would fit into my proposed study population. Since completing the above proposal for IRDL though, both issues have been resolved– first I applied for and received my research grant from the NY/NJ Chapter of the Medical Library Association! You can read my successful application here. The funding will allow me to have a much faster turn around time between data collection (interviews) and the analysis/synthesis phase of qualitative arm of my study. Having secured this funding I am confident in the below diagram pretty accurately representing the expected timeline:

Data Literacy Study Timeline


As far as recruitment methods go (currently in the recruitment phase!) after working closely with my co-investigators (Dr. Stephen Johnson and Dr. Joshua Richardson) we have decided to take a very systematic approach to identifying researchers for the study. You see, WCMC is a giant place, and identifying researchers who work with clinical data, but not doing clinical trials research, who have currently active federal grants, and are full-time faculty members with Cornell, is not easy. We settled on this group as we felt they would best represent the research being done at WCMC, but it’s difficult to know who to solicit to given the complexities of the college/hospital’s infrastructure (Just determining how many departments/centers/etc. we have is a challenge). So we agreed that the best way to come up with a population from which to sample, we would need to use all available institutional data sources and to cross reference them to find where all conditions were met:

Screen Shot 2014-10-27 at 4.01.40 PM

To do this, I got access to and pulled data from our institutional grants tracking database (this took a few weeks) to get data on who has federal grants, and the date ranges associated with their applications and renewals (along with the network IDs of each researcher). I then combined that data with data I pulled from our faculty affairs database to get everyone’s authoritative titles (with more than one appointment this is challenging) and their primary departments. I was also able to pull their educational backgrounds giving me information regarding what types of degrees these researchers have. I then cross referenced this set with a dataset I pulled from our researcher profiling system VIVO where I had constructed a query to *hopefully* (has face validity) pull out researchers who work with patient data but not doing clinical trials. Why not just use VIVO for all the data you ask? Because VIVO is still in development and unfortunately not all the data is easily scraped or validated. Once I cross referenced all of these datasets and removed administrators, fellows, and department chairs who likely won’t give me the time of day (built a simple SQL based DB for this in Querious and canned my queries) I generated a report consisting of 62 researchers who fit my criteria. If I remove full professors (as they might also not give me the time of day) I would be down to 40 individuals to sample from. I am meeting with my collaborators next week to determine which population we’re going to reach out to. Either way I will send them my email templates (which I’ll tailor some to make them more personal) and hope they let me interview them and some members of their research teams.

Side note* I wanted to see if I could narrow these results further using network analysis. I pulled all (disambiguated) publications from VIVO and crossed it with my current purposive sample. From the resulting list of publications I extracted co-authorship networks to try to see if I could readily identify any teams I could target for interviews. Below you can find a (VERY simple) network I created using a Kamada-Kawai un-weighted force directed layout of the co-authorship tendencies of researchers identified in my sample — note all labels are removed because these people are potential research subjects. I hope to explore this method further but will need to consult with my co-investigators as to whether or not it is advisable to limit my population further in this way. Either way we plan to write a brief methods paper about using data pulls from various institutional systems to identify a purposive sample.


Screen Shot 2014-10-27 at 3.17.42 PM


So I think I have my sampling problem mostly solved :) The rest of my time on this project has been spent developing my interview guide (MAKING IT SHORT) in such a way that it pulls the relevant data I need to construct my follow up survey. I’m also probably going to make a poster for RDAP on creating a streamlined data literacy assessment tool aimed at identifying social and technical obstacles impeding data literacy in biomedical research centers. The interview protocol is almost done and will be distributed post study.

This whole process has basically been a combination of excellent fulfilling excitement in seeing my ideas actually start to come into reality, and an emotional no-good-very-bad headache inducing mess to try to communicate to others the purpose of this study and to get the IRB protocol submitted. Feels like falling down a flight of stairs that has cake at the bottom… Can’t wait to see how it all turns out.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Avoiding my blog all summer was a bad idea.

Since my last post in June, so much has happened it’s difficult to know where to start. But I need to start somewhere, so I’m going to begin by outlining the things that I will be posting about in the next week or so:

  • Data Literacy Research Study Design and the Time Consuming Mess that is Completing an IRB Protocol (… will need a shorter title)
  • Using REDCap to manage a Systematic Review Service (Actually happening finally! Woot! See my past post about REDCap for details)
  • New project: Researcher Needs and Navigating Data-Related Institutional and Funder Policies (Waaaay more challenging than I originally expected)
  • I entered a data visualization into an art show? (Vis having to do with contradictory language in biomedical publication abstracts)
  • Applying for my first grant (Easier than originally expected)
  • I wrote this thing on databases and spreadsheets for the eScience Portal for Librarians back in July (consider this your update)
  • And a friend and fellow IRDL Scholar (Chris Eaker) and I co-wrote a post to the same blog about our experience with the Institute for Research Design in Librarianship

This post though, I hereby dedicate to the aversion to blogging that I developed after completing my research training in LA. I was doing so much writing and drafting there and had so much more to do when I got back, that the very idea of writing a post for my blog made me laugh. I also recently moved (stressful) to Brighton Beach in Brooklyn and have been spending a lot of my spare time swimming in the ocean and trying to work on my work-life balance (it’s really important!). I love the area and am loving the now deserted beaches even more :)

I’m learning though by looking at the backlog I’ve developed that the writing would have been time well-spent. I know that writing out what I’ve been doing helps me reflect and more clearly think through next steps, but I’m still learning how to keep myself disciplined in writing.

Brighton Beach, Brooklyn. Autumn sunrise near my new home.

Brighton Beach, Brooklyn. Autumn sunrise near my new home.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

The Institute for Research Design in Librarianship


A few months ago I was accepted into a program called The Institute for Research Design in Librarianship, and after much ado, I am finally in Los Angeles participating as a scholar in the inaugural class.  I wrote a bit about getting accepted a few months ago in another blog post, but here I’m going to elaborate a bit.

The goal of the Research Institute is to provide new researchers in Information and Library Science fields with in-depth training in research methods, and to support us in developing professional research networks as we embark on our first attempts at comprehensive research and publishing in peer-reviewed journals. We spend about 9 hours a day in class doing exercises and discussing study design and the research process, while focusing on revising our initial study proposals. 


Our instructors include the author of our textbook and many other works on research methods, Greg Guest, along with others in the field who consult with us on our specific project needs and help advise us as we move through the program

We have about an hour in the afternoon to write and are encouraged to participate in an online community and incorporate our reflective writing into a blog so we can help each other throughout our research process. Because of this online community you can now follow my progress as I design my first research study on my IRDL blog: “Daina and Research Design” http://buddy.irdlonline.org/dbouquin/ 

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

New Experiences with Teaching- Computational Methods in Health Informatics

I am currently sitting in an apartment in Los Angeles, California (first time to the west coast!) feeling exhausted after my first day participating as a research scholar at the Institute of Research Design in Librarianship at Loyola Marymount University. I described the institute in another post when I was first accepted, and I am positive I will have much more to write on this whole experience as I go through it, but tonight I am going to focus on my experience so far teaching my first full semester course at Cornell Medical– Computational Methods in Health Informatics.

So, how did I come to be teaching this class? Last October, I was asked by the Professor designing this new Comp Methods course for the Masters Program in Health Informatics for suggestions on textbooks and resources that could act as an introduction to data mining methods that wouldn’t be too overwhelming for beginners, and that could possibly make use of a common algorithm suite– I suggested Weka (and this is the book we initially decided to go with). But the more we discussed the course objectives, the more it became obvious that the students should really learn basic R coding– it’s a skillset that they wouldn’t likely learn in their other classes and one that is much more versatile than any other platform they could be using in this domain. And so we were off looking for another textbook, and our conversation about resources grew. I put together a list of suggested books, but in the end, the head instructor decided to go with a book that was suggested to him by a colleague (this one) that had a much less in-depth focus than any I had chosen. The chosen book is more concerned with specific applications of data mining methods and shorter than any other text we could find, which can be good and bad in that there is now an assumption that the students are familiar with much of the background and terminology inherent in this area (and that is certainly not the case). So we decided that I would develop a resource guide for students to supplement the textbook and be a place that students could go to for help on topics that went beyond the scope and goals of the course. The course syllabus is available here: Syllabus_HINF5008, but basically we wanted to have a course that would act as an introduction to data mining and computational approaches so that the students could make informed decisions about methodology and better communicate and collaborate with others using these techniques.

So I produced this resource guide for the class and was invited to introduce these resources and topics in the first lecture. And because I am an R user with Graduate Level training in Data Mining, I was also invited to teach a few other lectures wherein I could include some conversation about data literacy and good data management practices. For example, I was invited to teach our lectures on working with unknown values and exploring datasets graphically– each of these topics are premised with the idea that there may not be enough documentation associated with these data (metadata) to fully understand them, and thus more exploration is warranted before anything else can happen with the dataset. That situation is a good framework to discuss the value of proper data management and curation to prevent situations where not much is known about the data, so I stepped in and talked about some of the services offered by the consulting group I’m in, the Cornell Research Data Management Services Group and some best practices :) I was also invited to teach a few lectures on specific data mining techniques that interested me and to introduce the final project and places to go to to find and download datasets that might be relevant to the students’ interests. The final project rubric isn’t finished yet, but we will be having the students come up with their own data mining question pertaining to a dataset that they go out and find, and to run/explain their analyses while documenting their work. They will then present their findings individually or in groups. 

As I mentioned, I am not the only instructor for this class. Rather we ended up basically splitting the content in half (this makes my being out of town much more reasonable– my co-instructor is teaching the two weeks that I am away) and as of this past Wednesday I finished teaching the first half of my content: three 1 hour lectures and three 2 hour labs. So what have I learned?

First, it feels like I’m relearning everything I know about these topics. I’m having to think about ways of approaching data mining and management topics as someone who has never heard anything about them. I was in that position just a few years ago but it’s still a challenge to step back and reframe my explanations in ways that are relatable.

Next, I almost immediately realized that teaching a full semester (even just half of it) is an incredibly time-consuming ordeal! I am still in the process of finding out how/if I will be paid for this unplanned additional teaching outside of the library, but it looks like I will be getting additional compensation since the classes are in the evening. This experience has been so much more intense than the few 1-2 hour lectures I’ve given in the past and I have a whole new-found appreciation for all of the planning that goes into a class!

I’m also seeing that teaching and working with the students is helping me to more fully understand the content of my lectures– they are asking questions! And it’s making me have to think about new angles and ways of interpreting data analyses that I wouldn’t have seen on my own. The students aren’t coming up with their own data stories yet but they are questioning the results of analyses they’re running out of their book which is very useful. It means they are thinking critically about what they’re doing. So I’m finding that teaching things that you’ve learned yourself helps you more comprehensively understand them. I’m also interacting with the students via Canvas which is a completely new experience for me.

To conclude, here’re some lecture outlines from my first lecture and most recent lecture to give you an idea of what’s going on: Lecture1_HINF5008Lecture7_HINF And an example of how we’re running the lab below. We decided to have them use the textbook as a lab guide and to use lectures as our venue to more fully discuss theory. This is the lab script I wrote for my most recent week as instructor. You can scroll left and right in the code block. Looking forward to the second half of this semester and getting feedback from the students about how they liked the class.

# HINF: 5008 Week 7 Lab
# June 11th, 2014; 5-7 pm
# Predicting: Support Vector Machines, Monte Carlo Evaluation
# Book section 3.1- (pp. 126-164)
# Name your file using the convention: LastNameFirstInitial_Lab7

# You will need objects that you created in lab 6. If you have cleared your workspace, re-run any needed script from lab 6.
# Remember that the prediction models we're using were all selected because these techniques are well known by their ability
# to handle highly nonlinear regression problems-- like those inherent in time series prediction.
# Many other approaches can be applied to problems like ours though, so do not assume you are limited to the approaches we discuss here.

# You will find numbered questions throughout this lab guide.
# These correspond to questions in a Word document available for download in the Assignments tab of Canvas.
# (The 9 questions are the same in both places.)
# As usual, you will be handing in 1) lab script, and 2) homework assignment
# Please hand in your lab script and homework assignment as two separate files.

# The homework assignment for this week is to answer the 9 questions.
# Use the homework Word document (rather than the lab script) as a template to record your answers for the homework.
# To facilitate grading, please include only the answers to the questions in the homework. In other words, all code
# and terminal output should be restricted to the lab script; please do not include any script or code in the homework
# unless it's necessary to answer the question.

#Load packages from book:

# Support Vector Machines (SVMs) - supervised learning method for classification and regression tasks

# ** Question 1: Review - What is the difference between classification and regression? 
#    You may use online resources or resources on the class guide at http://med.cornell.libguides.com/HINF5008 
#    to answer this question if necessary- please cite your source if this is the case

# SVM is better at generalizing than our earlier ANN
# The basic idea behind SVMs is that of mapping the original data into a
# new, high-dimensional space so that it's possible to apply linear models to
# obtain a separating hyperplane (pg.127)
# The mapping of the original data into this new space is carried out with the help of kernel functions
# See lecture 7 notes and Section 9.3 in Han's "Data Mining: Concepts and Techniques" available on the resource guide for more 
# on information on Kernel functions
# SVMs maximize the separation margin between cases belonging to diverent classes (pg.127)

# Try a regression task with SVM (pg.128)
sv <- svm(Tform, Tdata.train[1:1000, ], gamma = 0.001, cost = 100)
s.preds <- predict(sv, Tdata.train[1001:2000, ])
sigs.svm <- trading.signals(s.preds, 0.1, -0.1)
true.sigs <- trading.signals(Tdata.train[1001:2000, "T.ind.GSPC"],0.1,-0.1)
sigs.PR(sigs.svm, true.sigs)

# ** Question 2: What can we observe about the precision and recall of this example compared to the ANN from week 6 (pg.126)?

# Now try a classification task with SVM (pg.128)
data <- cbind(signals = signals, Tdata.train[, -1])
ksv <- ksvm(signals ~ ., data[1:1000, ], C = 10)
ks.preds <- predict(ksv, data[1001:2000, ])
sigs.PR(ks.preds, data[1001:2000, 1])

# ** Question 3: Why did we change the C parameter of the ksvm() function? 
#    See pg. 128, online resources, or use the ?ksvm command for more information about this function

# We will skip to Section 3.5 (pg. 130)
# Predictions into Actions - 
# We will examine how the signal predictions we obtained with our models be used (assuming we are trading in future markets).

# Stock specific terms: (pg. 131)
# Markets are based on contracts to buy or sell a commodity on a certain date at 
# the price determined by the market at the future time
# Long positions are opened by buying at time t and price p, and selling later (t + x)
# Short positions are when a trader sells at time t with the obligation of buying in the future
# Generally one opens short positions when we believe prices are going down, and long positions when we believe prices are going up.

# The trading strategies defined on pages 131-132 are summarized here:

# First trading strategy we will employ:
# End of first day, models provide evidence that prices are going down-- a low value of T (the sell signal)
# Therefore we issue sell order if one is not already being issued.
# When this order is carried out by the market at price pc sometime in the future, we will immediately post 2 other orders:
# 1. a "buy limit order" with a limit price of pr - p%, where p% is the target profit margin --
# this will only be carried out if the market price reaches the target limit price or below-- 
# This order expresses our target profit for the port position just opened -- we will wait 10 days for the target to be reached
# If the order isn't carried out by the 10th day we will buy at the closing price of the 10th day.
# 2. a "buy stop order" with a price limit of pr +1% --
# this order is placed with the goal of limiting eventual losses to 1%-- it will be executed if the market reaches the price pr + 1%

# Second trading strategy we will employ:
# End of first day, models provide evidence that prices are going up-- a high value of T (the sell signal)
# Therefore we issue a long order if one is not already being issued.
# We will post a buy order that will be accomplished at time t and price pr, and immediately post 2 other orders:
# 1. a "sell limit order" with a limit price of pr + p%, where p% is the target profit margin --
# this will only be carried out if the market price reaches the target limit price or above-- Sell limit order will have 10 day deadline
# 2. a "sell stop order" with a price limit of pr - 1% --
# this order is placed with the goal of limiting eventual losses to 1%-- it will be executed if the market reaches the price pr - 1%

# The metrics from 3.3.4 do not fully translate to overall economic performance, so we will use the R package:
# PerformanceAnalytics to analyse our performance metrics
# With respect to the overall results we will use:
# 1. Net balance between initial capital and the capital at the end of the testing period (profit/loss)
# 2. Percentage return that this net balance represents
# 3. The excess return over the buy and hold strategy
# More on these metrics is available on pg. 132
# For risk-related measures, we will use the Sharpe ratio coefficient to measure the return per unit of risk 
# (the standard deviation of the returns)
# We will also calculate maximum draw-down-- this measures the maximum cumulative successive loss of the model
# Performance of the positions hold during the test period will be evaluated 

# A simulated trader will be used to put everything together (pg. 133)
# The function trading.simulator() will be used-- this function is in the book package DMwR
# the result of the trader is an object of class tradeRecord containing information of the simulation-- 
# the object can be used in other functions to obtain economic evaluation metrics or graphs of the traidng activity
# the user needs to supply the simulator with trading policy functions written in such a way that the user is aware 
# of how the simulator calls them
# at the end of each day d, the simulator calls the trading policy with 4 main arguments:
# 1. a vector with predicted signals until day d
# 2. market quotes up to day d
# 3. the currently opened positions
# 4. the money currently available to the trader

# Run the trading strategies reading the comments so that you understand the functions

# Strategy 1:
policy.1 <- function(signals,market,opened.pos,money,
                     exp.prof=0.025, max.loss= 0.05
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b' && !nOs) {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                                 val = c(quant,
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
    # ii) short positions
  } else if (signals[d] == 's' && !nOs) {
    # this is the nr of stocks we already need to buy
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                                 val = c(quant,
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
  # Now lets check if we need to close positions
  # because their holding time is over
  if (nOs)
    for(i in 1:nOs) {
      if (d - opened.pos[i,'Odate'] >= hold.time)
        orders <- rbind(orders,
                                   val = NA,
                                   action = 'close',
                                   posID = rownames(opened.pos)[i]

#Strategy 2.
policy.2 <- function(signals,market,opened.pos,money,
                     bet=0.2,exp.prof=0.025, max.loss= 0.05
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b') {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                                 val = c(quant,
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
    # ii) short positions
  } else if (signals[d] == 's') {
    # this is the money already committed to buy stocks
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                                 val = c(quant,
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)

# ** Question 4: Explain the input parameters for the functions that define policy.1 (pg. 133)
# ** signals, market,opened.pos,money, bet=0.2, hold.time=10, exp.prof=0.025, max.loss= 0.05

#Run the trading simulator with the first policy:
# Train and test periods
start <- 1
len.tr <- 1000
len.ts <- 500
tr <- start:(start+len.tr-1)
ts <- (start+len.tr):(start+len.tr+len.ts-1)
# getting the quotes for the testing period
date <- rownames(Tdata.train[start+len.tr,])
market <- GSPC[paste(date,'/',sep='')][1:len.ts]
# learning the model and obtaining its signal predictions
s <- svm(Tform,Tdata.train[tr,],cost=10,gamma=0.01)
p <- predict(s,Tdata.train[ts,])
sig <- trading.signals(p,0.1,-0.1)
# now using the simulated trader
t1 <- trading.simulator(market,sig,

# Check the results:

# Try plotting the results:
plot(t1, market, theme = "white", name = "SP500")

# Results of this trader are bad-- there is a negative return. Try the second policy
t2 <- trading.simulator(market, sig, "policy.2", list(exp.prof = 0.05, bet = 0.3))

# the return decreased further
# try a different training and testing period:
start <- 2000
len.tr <- 1000
len.ts <- 500
tr <- start:(start + len.tr - 1)
ts <- (start + len.tr):(start + len.tr + len.ts - 1)
s <- svm(Tform, Tdata.train[tr, ], cost = 10, gamma = 0.01)
p <- predict(s, Tdata.train[ts, ])
sig <- trading.signals(p, 0.1, -0.1)
t2 <- trading.simulator(market, sig, "policy.2", list(exp.prof = 0.05, bet = 0.3))

# This result was even worst-- do not be fooled  by a few repetitions of the same experiment 
# even if it includes 2 years of training and testing periods-- 
# we need more repetitions under different contions to ensure statistical reliability of our results

## Model Evaluation and Selection- How to obtain reliable estimates of the selected evalutation criteria

# Monte Carlo Estimates
# We will use these to estimate the reliability of our evaluation metrics because we cannot use cross-validatation

# ** Question 5: Why can we not use cross-validation? (pg. 141)

# We will use a train + test setup to obtain our estimates that ensures that the size of both the train and test sets 
# used are smaller than N to ensure we can randomly generate different experimental scenarios
# We will use a training set of 10 years and a test set of 5 years (pg. 142) in a Monte Carlo experiment to obtain reliable 
# measures of our evaluation metrics

# ** Question 6: Which windowing technique are we using here? (pg. 122 for review)

# We will then cary out paired comparisons to obtain statistical confidence levels on the observed differences of mean performance

# Create the following functions (pg. 143-144) that will be used to carry out the full train + test + evaluate cycle using different models
# Names ending in R are regression models, names ending in C are Classification models

MC.svmR <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                    ...) {
  t <- svm(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
MC.svmC <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                    ...) {
  tgtName <- all.vars(form)[1]
  train[, tgtName] <- trading.signals(train[, tgtName],
                                      b.t, s.t)
  t <- svm(form, train, ...)
  p <- predict(t, test)
  factor(p, levels = c("s", "h", "b"))
MC.nnetR <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  t <- nnet(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
MC.nnetC <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  tgtName <- all.vars(form)[1]
  train[, tgtName] <- trading.signals(train[, tgtName],
                                      b.t, s.t)
  t <- nnet(form, train, ...)
  p <- predict(t, test, type = "class")
  factor(p, levels = c("s", "h", "b"))
MC.earth <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  t <- earth(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
single <- function(form, train, test, learner, policy.func,
                   ...) {
  p <- do.call(paste("MC", learner, sep = "."), list(form,
                                                     train, test, ...))
  eval.stats(form, train, test, p, policy.func = policy.func)
slide <- function(form, train, test, learner, relearn.step,
                  policy.func, ...) {
  real.learner <- learner(paste("MC", learner, sep = "."),
                          pars = list(...))
  p <- slidingWindowTest(real.learner, form, train, test,
  p <- factor(p, levels = 1:3, labels = c("s", "h", "b"))
  eval.stats(form, train, test, p, policy.func = policy.func)
grow <- function(form, train, test, learner, relearn.step,
                 policy.func, ...) {
  real.learner <- learner(paste("MC", learner, sep = "."),
                          pars = list(...))
  p <- growingWindowTest(real.learner, form, train, test,
  p <- factor(p, levels = 1:3, labels = c("s", "h", "b"))
  eval.stats(form, train, test, p, policy.func = policy.func)

# The above functions obtain predictions and collect the evaluation statistics that we want to estimate 
# We do this using eval.stats (pg. 145) defined as:
eval.stats <- function(form,train,test,preds,b.t=0.1,s.t=-0.1,...) {
  # Signals evaluation
  tgtName <- all.vars(form)[1]
  test[,tgtName] <- trading.signals(test[,tgtName],b.t,s.t)
  st <- sigs.PR(preds,test[,tgtName])
  dim(st) <- NULL
  names(st) <- paste(rep(c('prec','rec'),each=3),
  # Trading evaluation
  date <- rownames(test)[1]
  market <- GSPC[paste(date,"/",sep='')][1:length(preds),]
  trade.res <- trading.simulator(market,preds,...)

# Next we set up a loop to go over a set of alternative trading systems (pg. 145)
# that calls the Monte Carlo routines (single, slide, and grow) with proper parameters to obtain estimates of their performace
pol1 <- function(signals,market,op,money)
pol2 <- function(signals,market,op,money)
pol3 <- function(signals,market,op,money)

# We are now able to run the Monte Carlo experiment (code on pages 146-147) but we will NOT-- 
# Just look over the code and read the comments

# ** Question 7: Why aren't we running the Monte Carlo code? (pg. 146)

# Results Analysis
# Download the objects resulting from the code at http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/extraFiles.html
# We will NOT examine the file "earth.Rdata"

getwd() #make sure the files are in your working directory


# Precision is more important than recall in this application (pg. 148)
# We will use the function rankSystems() to examine our results

# ** Question 8: Why is precision more important than recall here? (pg. 148)

# Examine: the return of the systems (Ret), the return over the buy and hold strategy (RetOverBH), 
# Percentage of profitable trades (PrecProf), SharpeRatio, and Maximum Draw-down (MaxDD) (pg. 149-150)

tgtStats <- c('prec.sb','Ret','PercProf',
allSysRes <- join(subset(svmR,stats=tgtStats),
                  by = 'variants')

# We have suspicious scores in our precision of buy/sell signals (obtaining 100% precision seems odd)
# Inspect these results closer:


#At most these methods made a single trade over the testing period with an average return of 0.25%, 
# which is −77.1% below the naive buy and hold strategy. These models are useless (pg. 151)

# To reach some conclusions on the value of these variants we need to add some constraints on some of the stats
# We want a resonalble number of average trades (more than 20), an average return that is at least greater than .5%, and
# a percentage of profitable trades higher than 40%
# Check to see if there are systems that satisfy these constrains:

fullResults <- join(svmR, svmC, nnetC, nnetR, by = "variants")
nt <- statScores(fullResults, "NTrades")[[1]]
rt <- statScores(fullResults, "Ret")[[1]]
pp <- statScores(fullResults, "PercProf")[[1]]
s1 <- names(nt)[which(nt > 20)]
s2 <- names(rt)[which(rt > 0.5)]
s3 <- names(pp)[which(pp > 40)]
namesBest <- intersect(intersect(s1, s2), s3)

# only 3 of the trading systems satisfy these criteria, and all of them use the regression task (have an R at the end of their name)
# The Ret of the single.nnetR.v2 shows marked instability (pg. 153) so we will compare the other two which have similar scores:

                      vars=namesBest)) # it's ok if you get warnings here (pg. 154)

# Despite the variability of the results, the above Wilcoxon significance test tells us that the average return of
# “single.nnetR.v12”is higher than those of the other systems with 95% confidence. 
# Yet, with respect to the other statistics, this variant is clearly worse.

# Try Plotting to get a better idea of the distribution of the scores across all 20 repetitons:

#The scores of the two systems using windowing schemas are very similar, but the results of “single.nnetR.v12” are distinct. 
# We can observe that the high average return is achieved thanks to an abnormal (around 2800%) return in one of 
# the iterations of the Monte Carlo experiment. 
# The remainder of the scores for this system seem inferior to the scores of the other two.

# Evaluating the final Test Data
# This section presents the results obtained by the "best" models in the final evaluation period. 
# This period is formed by 9 years of quotes and we will apply the five selected systems (pg. 156)

# obtain the evaluation statistics of these systems on the 9-year test period
# We need the last 10 years before the evaluation period-- the models will be obtained with these 10 years of data 
# and then will be asked to make their signal predictions for the 9 year evaluation period

#Check out our best model:

getVariant("grow.nnetR.v12", fullResults) # (pg. 157)

# Conduct a deeper analysis to obtain the trading record of the system during this period
data <- tail(Tdata.train, 2540)

model <- learner("MC.nnetR", list(maxit = 750, linout = T,
                                  trace = F, size = 10, decay = 0.001))
preds <- growingWindowTest(model, Tform, data, Tdata.eval,
                           relearn.step = 120)
signals <- factor(preds, levels = 1:3, labels = c("s", "h",
date <- rownames(Tdata.eval)[1]
market <- GSPC[paste(date, "/", sep = "")][1:length(signals),
trade.res <- trading.simulator(market, signals, policy.func = "pol2")

#plot the results
plot(trade.res, market, theme = "white", name = "SP500 - final test")

# ** Question 9: Save your final plot as a .png and insert it into your word doc homework submission

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Jargon and the e-Science Community Blog

Yesterday, I wrote a post for the e-Science Community Blog, which is:

intended to serve as both a bulletin board for news, upcoming events, and continuing education/job opportunities as well as a forum that librarians can use to post questions or to initiate and engage in discussions. All librarians interested in the emerging area of e-science librarianship are welcome to participate!”

And as the new content editor for Data Management for the accompanying e-Science Portal, I’m expected to periodically write blog posts to the community site. I tried to think of something that could be kept around 600 words or less and would be relatable (harder than I thought it would be)– what I ended up with was a post on e-Science jargon. I wanted to elaborate and give more comprehensive examples, especially for the first section, but I guess that’s what this blog is for (post on teaching Computational Methods in Healthcare Informatics soon to come!). My first post for the Community blog is reposted below. The original can be found here.

Also, here’s the Prezi I mention in the post :)

Tips for New Data Librarians: Working Around the Jargon

Submitted by guest contributor Daina Bouquin, Data & Metadata Services Librarian, Weill Cornell Medical College of Cornell University, dab2058@med.cornell.edu

In my experience, making metadata part of the conversation is one of the hardest things about being a data librarian. That is to say, that working data information literacy and data management into the conversation with anyone who isn’t already concerned with these topics can be incredibly difficult– even more so when words and phrases like “metadata,” “version control,” ”data integration,” “semantic data structures,” “repositories,” and even “e-Science”  are so foreign. Data librarians need to learn how to navigate both the jargon associated with their field and the need to communicate these issues with their patrons, without alienating or disinteresting them from the start.

As a relatively new data librarian in a biomedical research setting, I have come to understand that I need to strategize how I introduce these topics– especially to those patrons who come to me hunting for analysis resources and aren’t as focused on the other issues inherent in data management and curation. Based on my own experience, these are some of my tips for getting around the jargon and getting things done.

First, leverage the discussions you are already having. Whether it’s talking about bibliographic management (managing metadata associated with literature) or finding literature to support someone’s research interests (talk a bit about their study design if you can) see how you can introduce e-Science topics into the conversation. For example, I was approached and asked to teach a class on using Prezi to the post-docs association at my institution, and used that opportunity to integrate data visualization and presentation basics into the topics I covered. The students found it valuable and the next time I taught the Prezi class I spent half the time talking about data vis and the value of having clean, well-managed data to make data visualization more simple and effective (this was at the request of those organizing the class). I have many more instances like this where reframing the conversation just a little led to a lot of data-related outreach.

Second, try to avoid a lot of jargon in your constructive criticism of a researcher’s current data management practices. Reframing the discussion to be relevant to the researcher is key. It’s very easy to confuse someone who isn’t familiar with the terms and concepts you’re discussing and it can come off as alienating and long-winded. Focus on asking questions and making consultations a constructive conversation– consultations are as much about learning about the researcher’s needs and how best to address them as they are about anything else. Ask them what their plans are if a lead investigator leaves, or if they have a secure backup strategy, or if they would like to explore more options for making their research more efficient– you don’t have to necessarily talk about “metadata” much, instead you can focus on data organization and research replicability which may be more straightforward.

Which brings to my last point, make the jargon secondary as often as you can– planning a data organization and collection strategy, discussing workflows and long-term storage are all words and phrases that are more straightforward than “data management planning”, “data collection instrument selection”, “data validation and audit capability”, “data citation” and “archiving”.  Literacy is incredibly important, but literacy goes way beyond just knowing the vocabulary. Try focusing on the strategies employed in the Data Curation Profiles Toolkit when doing consultations and interviews and familiarize yourself with the Glossaries of Data Management Terms and the DMPTool so you are sure you can explain what terms and policies mean when you need to, but focus mostly on making positive changes and being approachable– we all know change is hard, try making it friendlier.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

REDCap and Library Data Management

In my experience (over the past year) I have come to realize that as much as librarians talk about research data management, they’re actually pretty bad at managing their own research workflows– as are most people. I mean I myself have a growing backlog of blog ideas with messy notes scattered about, but I have a lot of trouble making time to sit and write here. So I do think it’s realistic to let some parts of personal data/information management fall by the wayside, however, in a professional setting this is quite a bit more unacceptable.

Case and point: my library has a service where some of our scientific research librarians will (by request) perform systematic reviews of the literature on complex biomedical topics. Now, a systematic review is a time-intensive and difficult process that involves multiple librarians spending hours searching and sifting through literature across multiple databases to come up with a comprehensive list of articles and other literature regarding a specific topic with particular study designs and constraints in order to foster the production of a meta-analysis– these meta-analyses make up the capstone of evidence based medicine and are commonly seen as the authoritative source of best practices in biomedical research.

And producing these systematic reviews means collecting a lot of data and metadata. You see, in order for a systematic review to be valid, like all research it should be replicable (though current data management and archival standards are lax in biomedical research– Damn you, NIH you already got the memo!). So ideally the librarians doing these searchers would keep their records in a way that required controls for versioning, comprehensive audit trails, controlled input/export formats, well-discussed archival processes, documented inclusion/exclusion criteria for articles, a set metadata structure and data architecture, and allow for multimodal interaction with the system in which the data/metadata was stored (e.g. writing notes, querying records, composing metrics, etc). But sadly this is just not the case with the workflow that my library currently has in place– or any that I have heard of for that matter. This is not to say that librarians are not keeping records and documenting their searches, it is to say that it is not being done in a way that is seamless or comprehensive. The librarians involved are making do with with what they know and what they have (these are not data librarians that I’m talking about so their training is not in computing or data-specific problem-solving– that’s my job). And as a result searches are tracked on wikis, in spreadsheets, in word documents, etc. without controls, a high degree of functionality, or sensitivity to archival practices.

So, I’m going to be working with my colleagues at the library to fix this problem. I have proposed and will hopefully very soon begin a relatively small (yet ambitious thanks to anticipated reluctance to change) project wherein I will be designing a data entry system and clarifying the workflow for the systematic review process. And I’m going to do it using REDCap.


REDCap is a really great clinical data management platform available to us for free through our Clinical Translational Sciences Center. Any customization we decide to leverage may end up costing us some money, but I’m hopeful that others will see the value in it. I’ve already run the idea by our Associate Director of the Biomedical Informatics Program within the CTSC and she was very supportive of me finding new ways to use the REDCap platform and encouraged me to pursue this project as and its implementation as a research interest– so we’ll see if that happens, for now I’d just be happy to see better data management practices within my library :) I also want to make note here that I’m not saying that REDCap works for all data management problems (it doesn’t! e.g. varied schedule longitudinal clinical studies) but for us, it’s way better than me trying to develop my own SQL-based solution (time consuming) or even a simple MS Access DB, which would be easy to set-up but not comprehensive. I’m going with REDCap is because REDCap gives people even very new to the system the ability to design data collection instruments and forms and to put clear controls on fields with comprehensive audit trails and version control so librarians can clearly track the systematic review process. It is also fully HIPPA compliant and cloud-hosted to allow for easy access, security, and archival procedures. It will also give the librarians experience using a system at that is used by our clinical researchers so that they will be more adept and comfortable if questions about it arise as they interact with researchers on a day-to-day basis.

Wish me luck :) I’ll keep you posted.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Mad Amounts of Professional Development

"your proposal was accepted"

So “professional development” is really important in library land, but what is it? All over the place I read how important it is to continue learning new skills and developing an attuned understanding of “emerging technologies” in order to address the needs of patrons as they evolve. Networking and skills training are emphasized, and I’m pretty sure everyone has accepted just how cool MOOCs can be, but really, how one approaches professional development is going to vary a lot from person to person. I personally am not a big fan of webinars and would rather meet people at a conference or go through whole bunch of Codecademy tutorials (finally learning Python) than go to a workshop, but that’s mostly thanks to a few bad workshop experiences– they really don’t have to suck but sometimes they do and it always feels like a gamble to me. Plus all this takes time so you need to prioritize to make anything happen, so sometimes professional development falls to the back.

But the almighty Wikipedia entry on professional development tells me that professional development in workplaces refers to “the acquisition of skills and knowledge both for personal development and for career advancement. Professional development encompasses all types of facilitated learning opportunities…” And what I’ve been taking that to mean is just keep your eyes open– you need to find fun/cool stuff to do so that people know you’re good at things and like learning. And you actually do need to like what you’re learning to take anything from professional development stuff… at least for me, if I’m not having some fun I won’t really invest myself in what I’m trying to do and it’ll just be a big waste of time (e.g. an RDA workshop that made me want to die). So recently I’ve been trying to figure out what I’m going to do to keep up with professional development, while not feeling like I’m wasting my energy on things I  feel like I should do, rather than things I want to do.

So what I did was start scanning over and quickly reading through everything I could on every listserv on every medical and/or academic library / health informatics list I could get on– not as time consuming as you’d think. I just came up with a list of keywords for topics that I wanted to invest time in and searched the lists for those (hooray command-F!) and started organizing the results chronologically. Keywords were things like “data”, “management”, “research”, “funding”, “statistics”, “metrics”, “recent graduates”, “metadata” etc. I ended up with a list of emails detailing workshops, meetings, MOOCs, essay contests (applying to these still), research opportunities, jobs, and articles written by people in the field. Then I got started applying to everything that didn’t look like I could do something similar on my own and would help me meet people who I could learn from. What I ended up with is a short list of things that I think will help me get some more professional development in ways that work for me.

I took time on the weekends and a little at work to apply for some great opportunities on that list, and I’m happy to report than some of them came through :)

First, I was invited to join the Editorial Team at the UMass Medical eScience Portal for Librarians after applying in January. I’m  honored and excited to get to work with such a fantastic team as I start collaborating with them remotely as the new Content Editor for Research Data Management. As an editor I’m tasked with working in collaboration with the rest of the eScience Portal Team to provide and manage information that meets the scope and purpose of the Portal– so basically I need to manage the resources that are collected on the portal regarding Research Data Management. I’m going to be researching and gathering links to web resources, creating blog posts, and managing links and other materials related to research data management as needed on the portal site. I will also need to attend some in-person meetings with the rest of the team as we go about redesigning the portal and looking for ways to further develop resources for our community. I really do hope this opportunity  is something I can build on and continue to grow with without it being too burdensome a time commitment– I’ll keep you posted :)

Next though, I am happy to say that I have been accepted into the first cohort of scholars at the Institute for Research Design in Librarianship at Loyola Marymount University in LA. I am soooooo excited for this one :) I found out about it with less than two weeks before the application deadline and had to scramble to put together a research proposal, but I’m happy to say I was accepted. The Institute is designed to help support and train library researchers and to help them develop professional research networks as they embark on their first attempts at comprehensive research and publishing in peer-reviewed journals. You can read an abbreviated version of my proposal below:


So I’m going to LA this summer and I cannot wait. It’s been a lot of work to put all of this together and continue with my professional development this way, but I think this just works better for me. I’ll continue participating in consortia and online learning opportunities, but for me, working with others on concrete projects makes me feel better about the skills I’m developing. I’m excited to see what comes next.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Getting Connected: Librarian Meets Clinical Data Management

I’m pretty sure that no matter where you go, the bigger the organization, the more communication problems you’re going to encounter. I think that’s the case with my current institution; you need to be plugged in culturally (and technologically) in order to get anywhere and find out what’s going on. This is what I’ve slowly been trying to do more and more of for the last few months as I’ve finally gotten connected up with an important group at Cornell– Architecture for Research Computing in Health (ARCH). This group is charged with developing secure repositories for clinical research data wherein the needs of the researchers are more complex than what current data management solutions available to them offer. I’m loving finally getting to work with them, even if it took 5 months to get an in– the delay was thanks to poor communication and a lot of assumptions being made up high in the hierarchy, but I made it happen so I’m happy anyway– it was a process I’m hoping to learn from. This clinical data repository program is trying to fill gaps that aren’t met by tools like REDCap and focus on long-term storage and curation of clinical research data from a variety of sources– this includes but isn’t limited to Biobank data, electronic health record data, clinical trial data, etc. What I’ll be doing though is helping design our metadata dictionary user interface and train researchers on how to use it :) I’ll have screenshots and examples of the interface soon, but basically the idea is that with systems this complex (dozens of tables and hundreds of variables) you’re going to need a tool to explore what’s available and understand things like when the data feed was updated, who owns the data, how a variable was measured, access permissions surrounding the data, data collection dates, etc. The interface should help users explore what’s available in our repositories and apply for different levels of access to the data to conduct various clinical research activities. The tool is also going to have a query builder built into the interface to allow researchers to begin generating reports and creating subsets of data for analysis. I’m also consulting on what metadata should be included in the metadata discovery tool and what can be left out. It’s an interesting process that I’m glad to be involved with.

Really though, it seemed the biggest obstacle in breaking in to the clinical data realm and getting a seat at the table, was getting the proper introductions. People I didn’t know existed, didn’t know I existed, and the people needing to make the connections, didn’t know they should make the connections. It also didn’t help that the group I was trying to get connected to had nearly no web presence– they still don’t have a website. The only advice I can really give in this situation is to continue insisting on learning more about what everyone around you does– don’t get too hung up on what you’re doing to notice the work being done by your colleagues and other working groups. By asking to find out more about what people I met were working on, I was able to realize who I should be introduced to and to get my supervisors to make the necessary arrangements for me to meet up with these people and find ways to collaborate. I’m now sitting in on their weekly data-issues meetings and getting to be involved with the whole building-out of the repositories. If only they taught you in library school just how much of the job is networking and navigating the politics of higher education. Learning by the seat of my pants.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

« Older posts

© 2015 Daina Bouquin

Theme by Anders NorenUp ↑