Daina Bouquin

Data Geek. Librarian. Dangerous Lady.

The Institute for Research Design in Librarianship

photo-27

A few months ago I was accepted into a program called The Institute for Research Design in Librarianship, and after much ado, I am finally in Los Angeles participating as a scholar in the inaugural class.  I wrote a bit about getting accepted a few months ago in another blog post, but here I’m going to elaborate a bit.

The goal of the Research Institute is to provide new researchers in Information and Library Science fields with in-depth training in research methods, and to support us in developing professional research networks as we embark on our first attempts at comprehensive research and publishing in peer-reviewed journals. We spend about 9 hours a day in class doing exercises and discussing study design and the research process, while focusing on revising our initial study proposals. 

41ktjKwiQVL._SS500_

Our instructors include the author of our textbook and many other works on research methods, Greg Guest, along with others in the field who consult with us on our specific project needs and help advise us as we move through the program

We have about an hour in the afternoon to write and are encouraged to participate in an online community and incorporate our reflective writing into a blog so we can help each other throughout our research process. Because of this online community you can now follow my progress as I design my first research study on my IRDL blog: “Daina and Research Design” http://buddy.irdlonline.org/dbouquin/ 

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

New Experiences with Teaching- Computational Methods in Health Informatics

I am currently sitting in an apartment in Los Angeles, California (first time to the west coast!) feeling exhausted after my first day participating as a research scholar at the Institute of Research Design in Librarianship at Loyola Marymount University. I described the institute in another post when I was first accepted, and I am positive I will have much more to write on this whole experience as I go through it, but tonight I am going to focus on my experience so far teaching my first full semester course at Cornell Medical– Computational Methods in Health Informatics.

So, how did I come to be teaching this class? Last October, I was asked by the Professor designing this new Comp Methods course for the Masters Program in Health Informatics for suggestions on textbooks and resources that could act as an introduction to data mining methods that wouldn’t be too overwhelming for beginners, and that could possibly make use of a common algorithm suite– I suggested Weka (and this is the book we initially decided to go with). But the more we discussed the course objectives, the more it became obvious that the students should really learn basic R coding– it’s a skillset that they wouldn’t likely learn in their other classes and one that is much more versatile than any other platform they could be using in this domain. And so we were off looking for another textbook, and our conversation about resources grew. I put together a list of suggested books, but in the end, the head instructor decided to go with a book that was suggested to him by a colleague (this one) that had a much less in-depth focus than any I had chosen. The chosen book is more concerned with specific applications of data mining methods and shorter than any other text we could find, which can be good and bad in that there is now an assumption that the students are familiar with much of the background and terminology inherent in this area (and that is certainly not the case). So we decided that I would develop a resource guide for students to supplement the textbook and be a place that students could go to for help on topics that went beyond the scope and goals of the course. The course syllabus is available here: Syllabus_HINF5008, but basically we wanted to have a course that would act as an introduction to data mining and computational approaches so that the students could make informed decisions about methodology and better communicate and collaborate with others using these techniques.

So I produced this resource guide for the class and was invited to introduce these resources and topics in the first lecture. And because I am an R user with Graduate Level training in Data Mining, I was also invited to teach a few other lectures wherein I could include some conversation about data literacy and good data management practices. For example, I was invited to teach our lectures on working with unknown values and exploring datasets graphically– each of these topics are premised with the idea that there may not be enough documentation associated with these data (metadata) to fully understand them, and thus more exploration is warranted before anything else can happen with the dataset. That situation is a good framework to discuss the value of proper data management and curation to prevent situations where not much is known about the data, so I stepped in and talked about some of the services offered by the consulting group I’m in, the Cornell Research Data Management Services Group and some best practices :) I was also invited to teach a few lectures on specific data mining techniques that interested me and to introduce the final project and places to go to to find and download datasets that might be relevant to the students’ interests. The final project rubric isn’t finished yet, but we will be having the students come up with their own data mining question pertaining to a dataset that they go out and find, and to run/explain their analyses while documenting their work. They will then present their findings individually or in groups. 

As I mentioned, I am not the only instructor for this class. Rather we ended up basically splitting the content in half (this makes my being out of town much more reasonable– my co-instructor is teaching the two weeks that I am away) and as of this past Wednesday I finished teaching the first half of my content: three 1 hour lectures and three 2 hour labs. So what have I learned?

First, it feels like I’m relearning everything I know about these topics. I’m having to think about ways of approaching data mining and management topics as someone who has never heard anything about them. I was in that position just a few years ago but it’s still a challenge to step back and reframe my explanations in ways that are relatable.

Next, I almost immediately realized that teaching a full semester (even just half of it) is an incredibly time-consuming ordeal! I am still in the process of finding out how/if I will be paid for this unplanned additional teaching outside of the library, but it looks like I will be getting additional compensation since the classes are in the evening. This experience has been so much more intense than the few 1-2 hour lectures I’ve given in the past and I have a whole new-found appreciation for all of the planning that goes into a class!

I’m also seeing that teaching and working with the students is helping me to more fully understand the content of my lectures– they are asking questions! And it’s making me have to think about new angles and ways of interpreting data analyses that I wouldn’t have seen on my own. The students aren’t coming up with their own data stories yet but they are questioning the results of analyses they’re running out of their book which is very useful. It means they are thinking critically about what they’re doing. So I’m finding that teaching things that you’ve learned yourself helps you more comprehensively understand them. I’m also interacting with the students via Canvas which is a completely new experience for me.

To conclude, here’re some lecture outlines from my first lecture and most recent lecture to give you an idea of what’s going on: Lecture1_HINF5008Lecture7_HINF And an example of how we’re running the lab below. We decided to have them use the textbook as a lab guide and to use lectures as our venue to more fully discuss theory. This is the lab script I wrote for my most recent week as instructor. You can scroll left and right in the code block. Looking forward to the second half of this semester and getting feedback from the students about how they liked the class.


# HINF: 5008 Week 7 Lab
# June 11th, 2014; 5-7 pm
# Predicting: Support Vector Machines, Monte Carlo Evaluation
# Book section 3.1-3.4.2.1 (pp. 126-164)
# Name your file using the convention: LastNameFirstInitial_Lab7

# You will need objects that you created in lab 6. If you have cleared your workspace, re-run any needed script from lab 6.
# Remember that the prediction models we're using were all selected because these techniques are well known by their ability
# to handle highly nonlinear regression problems-- like those inherent in time series prediction.
# Many other approaches can be applied to problems like ours though, so do not assume you are limited to the approaches we discuss here.

# You will find numbered questions throughout this lab guide.
# These correspond to questions in a Word document available for download in the Assignments tab of Canvas.
# (The 9 questions are the same in both places.)
# As usual, you will be handing in 1) lab script, and 2) homework assignment
# Please hand in your lab script and homework assignment as two separate files.

# The homework assignment for this week is to answer the 9 questions.
# Use the homework Word document (rather than the lab script) as a template to record your answers for the homework.
# To facilitate grading, please include only the answers to the questions in the homework. In other words, all code
# and terminal output should be restricted to the lab script; please do not include any script or code in the homework
# unless it's necessary to answer the question.

#Load packages from book:
install.packages("randomForest")
library(randomForest)
install.packages("quantmod")
library(quantmod)
install.packages("kernlab")
library(kernlab)
install.packages("e1071")
library(e1071)
install.packages("mda")
library(mda)
install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)

# Support Vector Machines (SVMs) - supervised learning method for classification and regression tasks

# ** Question 1: Review - What is the difference between classification and regression? 
#    You may use online resources or resources on the class guide at http://med.cornell.libguides.com/HINF5008 
#    to answer this question if necessary- please cite your source if this is the case

# SVM is better at generalizing than our earlier ANN
# The basic idea behind SVMs is that of mapping the original data into a
# new, high-dimensional space so that it's possible to apply linear models to
# obtain a separating hyperplane (pg.127)
# The mapping of the original data into this new space is carried out with the help of kernel functions
# See lecture 7 notes and Section 9.3 in Han's "Data Mining: Concepts and Techniques" available on the resource guide for more 
# on information on Kernel functions
# SVMs maximize the separation margin between cases belonging to diverent classes (pg.127)

# Try a regression task with SVM (pg.128)
sv <- svm(Tform, Tdata.train[1:1000, ], gamma = 0.001, cost = 100)
s.preds <- predict(sv, Tdata.train[1001:2000, ])
sigs.svm <- trading.signals(s.preds, 0.1, -0.1)
true.sigs <- trading.signals(Tdata.train[1001:2000, "T.ind.GSPC"],0.1,-0.1)
sigs.PR(sigs.svm, true.sigs)

# ** Question 2: What can we observe about the precision and recall of this example compared to the ANN from week 6 (pg.126)?

# Now try a classification task with SVM (pg.128)
data <- cbind(signals = signals, Tdata.train[, -1])
ksv <- ksvm(signals ~ ., data[1:1000, ], C = 10)
ks.preds <- predict(ksv, data[1001:2000, ])
sigs.PR(ks.preds, data[1001:2000, 1])

# ** Question 3: Why did we change the C parameter of the ksvm() function? 
#    See pg. 128, online resources, or use the ?ksvm command for more information about this function

# We will skip to Section 3.5 (pg. 130)
# Predictions into Actions - 
# We will examine how the signal predictions we obtained with our models be used (assuming we are trading in future markets).

# Stock specific terms: (pg. 131)
# Markets are based on contracts to buy or sell a commodity on a certain date at 
# the price determined by the market at the future time
# Long positions are opened by buying at time t and price p, and selling later (t + x)
# Short positions are when a trader sells at time t with the obligation of buying in the future
# Generally one opens short positions when we believe prices are going down, and long positions when we believe prices are going up.

# The trading strategies defined on pages 131-132 are summarized here:

# First trading strategy we will employ:
# End of first day, models provide evidence that prices are going down-- a low value of T (the sell signal)
# Therefore we issue sell order if one is not already being issued.
# When this order is carried out by the market at price pc sometime in the future, we will immediately post 2 other orders:
# 1. a "buy limit order" with a limit price of pr - p%, where p% is the target profit margin --
# this will only be carried out if the market price reaches the target limit price or below-- 
# This order expresses our target profit for the port position just opened -- we will wait 10 days for the target to be reached
# If the order isn't carried out by the 10th day we will buy at the closing price of the 10th day.
# 2. a "buy stop order" with a price limit of pr +1% --
# this order is placed with the goal of limiting eventual losses to 1%-- it will be executed if the market reaches the price pr + 1%

# Second trading strategy we will employ:
# End of first day, models provide evidence that prices are going up-- a high value of T (the sell signal)
# Therefore we issue a long order if one is not already being issued.
# We will post a buy order that will be accomplished at time t and price pr, and immediately post 2 other orders:
# 1. a "sell limit order" with a limit price of pr + p%, where p% is the target profit margin --
# this will only be carried out if the market price reaches the target limit price or above-- Sell limit order will have 10 day deadline
# 2. a "sell stop order" with a price limit of pr - 1% --
# this order is placed with the goal of limiting eventual losses to 1%-- it will be executed if the market reaches the price pr - 1%

# The metrics from 3.3.4 do not fully translate to overall economic performance, so we will use the R package:
# PerformanceAnalytics to analyse our performance metrics
# With respect to the overall results we will use:
# 1. Net balance between initial capital and the capital at the end of the testing period (profit/loss)
# 2. Percentage return that this net balance represents
# 3. The excess return over the buy and hold strategy
# More on these metrics is available on pg. 132
# For risk-related measures, we will use the Sharpe ratio coefficient to measure the return per unit of risk 
# (the standard deviation of the returns)
# We will also calculate maximum draw-down-- this measures the maximum cumulative successive loss of the model
# Performance of the positions hold during the test period will be evaluated 

# A simulated trader will be used to put everything together (pg. 133)
# The function trading.simulator() will be used-- this function is in the book package DMwR
# the result of the trader is an object of class tradeRecord containing information of the simulation-- 
# the object can be used in other functions to obtain economic evaluation metrics or graphs of the traidng activity
# the user needs to supply the simulator with trading policy functions written in such a way that the user is aware 
# of how the simulator calls them
# at the end of each day d, the simulator calls the trading policy with 4 main arguments:
# 1. a vector with predicted signals until day d
# 2. market quotes up to day d
# 3. the currently opened positions
# 4. the money currently available to the trader

# Run the trading strategies reading the comments so that you understand the functions

# Strategy 1:
policy.1 <- function(signals,market,opened.pos,money,
                     bet=0.2,hold.time=10,
                     exp.prof=0.025, max.loss= 0.05
)
{
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b' && !nOs) {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(1,-1,-1),order.type=c(1,2,3),
                                 val = c(quant,
                                         market[d,'Close']*(1+exp.prof),
                                         market[d,'Close']*(1-max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
    
    # ii) short positions
  } else if (signals[d] == 's' && !nOs) {
    # this is the nr of stocks we already need to buy
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
                               "N.stocks"])*market[d,'Close']
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(-1,1,1),order.type=c(1,2,3),
                                 val = c(quant,
                                         market[d,'Close']*(1-exp.prof),
                                         market[d,'Close']*(1+max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
  }
  
  # Now lets check if we need to close positions
  # because their holding time is over
  if (nOs)
    for(i in 1:nOs) {
      if (d - opened.pos[i,'Odate'] >= hold.time)
        orders <- rbind(orders,
                        data.frame(order=-opened.pos[i,'pos.type'],
                                   order.type=1,
                                   val = NA,
                                   action = 'close',
                                   posID = rownames(opened.pos)[i]
                        )
        )
    }
  
  orders
}

#Strategy 2.
policy.2 <- function(signals,market,opened.pos,money,
                     bet=0.2,exp.prof=0.025, max.loss= 0.05
)
{
  d <- NROW(market) # this is the ID of today
  orders <- NULL
  nOs <- NROW(opened.pos)
  # nothing to do!
  if (!nOs && signals[d] == 'h') return(orders)
  
  # First lets check if we can open new positions
  # i) long positions
  if (signals[d] == 'b') {
    quant <- round(bet*money/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(1,-1,-1),order.type=c(1,2,3),
                                 val = c(quant,
                                         market[d,'Close']*(1+exp.prof),
                                         market[d,'Close']*(1-max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
    
    # ii) short positions
  } else if (signals[d] == 's') {
    # this is the money already committed to buy stocks
    # because of currently opened short positions
    need2buy <- sum(opened.pos[opened.pos[,'pos.type']==-1,
                               "N.stocks"])*market[d,'Close']
    quant <- round(bet*(money-need2buy)/market[d,'Close'],0)
    if (quant > 0)
      orders <- rbind(orders,
                      data.frame(order=c(-1,1,1),order.type=c(1,2,3),
                                 val = c(quant,
                                         market[d,'Close']*(1-exp.prof),
                                         market[d,'Close']*(1+max.loss)
                                 ),
                                 action = c('open','close','close'),
                                 posID = c(NA,NA,NA)
                      )
      )
  }
  
  orders
}

# ** Question 4: Explain the input parameters for the functions that define policy.1 (pg. 133)
# ** signals, market,opened.pos,money, bet=0.2, hold.time=10, exp.prof=0.025, max.loss= 0.05

#Run the trading simulator with the first policy:
# Train and test periods
start <- 1
len.tr <- 1000
len.ts <- 500
tr <- start:(start+len.tr-1)
ts <- (start+len.tr):(start+len.tr+len.ts-1)
# getting the quotes for the testing period
data(GSPC)
date <- rownames(Tdata.train[start+len.tr,])
market <- GSPC[paste(date,'/',sep='')][1:len.ts]
# learning the model and obtaining its signal predictions
library(e1071)
s <- svm(Tform,Tdata.train[tr,],cost=10,gamma=0.01)
p <- predict(s,Tdata.train[ts,])
sig <- trading.signals(p,0.1,-0.1)
# now using the simulated trader
t1 <- trading.simulator(market,sig,
                        'policy.1',list(exp.prof=0.05,bet=0.2,hold.time=30))

# Check the results:
t1
tradingEvaluation(t1)

# Try plotting the results:
plot(t1, market, theme = "white", name = "SP500")

# Results of this trader are bad-- there is a negative return. Try the second policy
t2 <- trading.simulator(market, sig, "policy.2", list(exp.prof = 0.05, bet = 0.3))
summary(t2)
tradingEvaluation(t1)

# the return decreased further
# try a different training and testing period:
start <- 2000
len.tr <- 1000
len.ts <- 500
tr <- start:(start + len.tr - 1)
ts <- (start + len.tr):(start + len.tr + len.ts - 1)
s <- svm(Tform, Tdata.train[tr, ], cost = 10, gamma = 0.01)
p <- predict(s, Tdata.train[ts, ])
sig <- trading.signals(p, 0.1, -0.1)
t2 <- trading.simulator(market, sig, "policy.2", list(exp.prof = 0.05, bet = 0.3))
summary(t2)
tradingEvaluation(t2)

# This result was even worst-- do not be fooled  by a few repetitions of the same experiment 
# even if it includes 2 years of training and testing periods-- 
# we need more repetitions under different contions to ensure statistical reliability of our results

## Model Evaluation and Selection- How to obtain reliable estimates of the selected evalutation criteria

# Monte Carlo Estimates
# We will use these to estimate the reliability of our evaluation metrics because we cannot use cross-validatation

# ** Question 5: Why can we not use cross-validation? (pg. 141)

# We will use a train + test setup to obtain our estimates that ensures that the size of both the train and test sets 
# used are smaller than N to ensure we can randomly generate different experimental scenarios
# We will use a training set of 10 years and a test set of 5 years (pg. 142) in a Monte Carlo experiment to obtain reliable 
# measures of our evaluation metrics

# ** Question 6: Which windowing technique are we using here? (pg. 122 for review)

# We will then cary out paired comparisons to obtain statistical confidence levels on the observed differences of mean performance

# Create the following functions (pg. 143-144) that will be used to carry out the full train + test + evaluate cycle using different models
# Names ending in R are regression models, names ending in C are Classification models

MC.svmR <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                    ...) {
  require(e1071)
  t <- svm(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
}
MC.svmC <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                    ...) {
  require(e1071)
  tgtName <- all.vars(form)[1]
  train[, tgtName] <- trading.signals(train[, tgtName],
                                      b.t, s.t)
  t <- svm(form, train, ...)
  p <- predict(t, test)
  factor(p, levels = c("s", "h", "b"))
}
MC.nnetR <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  require(nnet)
  t <- nnet(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
}
MC.nnetC <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  require(nnet)
  tgtName <- all.vars(form)[1]
  train[, tgtName] <- trading.signals(train[, tgtName],
                                      b.t, s.t)
  t <- nnet(form, train, ...)
  p <- predict(t, test, type = "class")
  factor(p, levels = c("s", "h", "b"))
}
MC.earth <- function(form, train, test, b.t = 0.1, s.t = -0.1,
                     ...) {
  require(earth)
  t <- earth(form, train, ...)
  p <- predict(t, test)
  trading.signals(p, b.t, s.t)
}
single <- function(form, train, test, learner, policy.func,
                   ...) {
  p <- do.call(paste("MC", learner, sep = "."), list(form,
                                                     train, test, ...))
  eval.stats(form, train, test, p, policy.func = policy.func)
}
slide <- function(form, train, test, learner, relearn.step,
                  policy.func, ...) {
  real.learner <- learner(paste("MC", learner, sep = "."),
                          pars = list(...))
  p <- slidingWindowTest(real.learner, form, train, test,
                         relearn.step)
  p <- factor(p, levels = 1:3, labels = c("s", "h", "b"))
  eval.stats(form, train, test, p, policy.func = policy.func)
}
grow <- function(form, train, test, learner, relearn.step,
                 policy.func, ...) {
  real.learner <- learner(paste("MC", learner, sep = "."),
                          pars = list(...))
  p <- growingWindowTest(real.learner, form, train, test,
                         relearn.step)
  p <- factor(p, levels = 1:3, labels = c("s", "h", "b"))
  eval.stats(form, train, test, p, policy.func = policy.func)
} 

# The above functions obtain predictions and collect the evaluation statistics that we want to estimate 
# We do this using eval.stats (pg. 145) defined as:
eval.stats <- function(form,train,test,preds,b.t=0.1,s.t=-0.1,...) {
  # Signals evaluation
  tgtName <- all.vars(form)[1]
  test[,tgtName] <- trading.signals(test[,tgtName],b.t,s.t)
  st <- sigs.PR(preds,test[,tgtName])
  dim(st) <- NULL
  names(st) <- paste(rep(c('prec','rec'),each=3),
                     c('s','b','sb'),sep='.')
  
  # Trading evaluation
  date <- rownames(test)[1]
  market <- GSPC[paste(date,"/",sep='')][1:length(preds),]
  trade.res <- trading.simulator(market,preds,...)
  
  c(st,tradingEvaluation(trade.res))
}

# Next we set up a loop to go over a set of alternative trading systems (pg. 145)
# that calls the Monte Carlo routines (single, slide, and grow) with proper parameters to obtain estimates of their performace
pol1 <- function(signals,market,op,money)
  policy.1(signals,market,op,money,
           bet=0.2,exp.prof=0.025,max.loss=0.05,hold.time=10)
pol2 <- function(signals,market,op,money)
  policy.1(signals,market,op,money,
           bet=0.2,exp.prof=0.05,max.loss=0.05,hold.time=20)
pol3 <- function(signals,market,op,money)
  policy.2(signals,market,op,money,
           bet=0.5,exp.prof=0.05,max.loss=0.05)

# We are now able to run the Monte Carlo experiment (code on pages 146-147) but we will NOT-- 
# Just look over the code and read the comments

# ** Question 7: Why aren't we running the Monte Carlo code? (pg. 146)

# Results Analysis
# Download the objects resulting from the code at http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/extraFiles.html
# We will NOT examine the file "earth.Rdata"

getwd() #make sure the files are in your working directory

load("svmR.Rdata")
load("svmC.Rdata")
load("nnetR.Rdata")
load("nnetC.Rdata")

# Precision is more important than recall in this application (pg. 148)
# We will use the function rankSystems() to examine our results

# ** Question 8: Why is precision more important than recall here? (pg. 148)

# Examine: the return of the systems (Ret), the return over the buy and hold strategy (RetOverBH), 
# Percentage of profitable trades (PrecProf), SharpeRatio, and Maximum Draw-down (MaxDD) (pg. 149-150)

tgtStats <- c('prec.sb','Ret','PercProf',
              'MaxDD','SharpeRatio')
allSysRes <- join(subset(svmR,stats=tgtStats),
                  subset(svmC,stats=tgtStats),
                  subset(nnetR,stats=tgtStats),
                  subset(nnetC,stats=tgtStats),
                  by = 'variants')
rankSystems(allSysRes,5,maxs=c(T,T,T,F,T))

# We have suspicious scores in our precision of buy/sell signals (obtaining 100% precision seems odd)
# Inspect these results closer:

summary(subset(svmC,
                 stats=c('Ret','RetOverBH','PercProf','NTrades'),
                 vars=c('slide.svmC.v5','slide.svmC.v6')))

#At most these methods made a single trade over the testing period with an average return of 0.25%, 
# which is −77.1% below the naive buy and hold strategy. These models are useless (pg. 151)

# To reach some conclusions on the value of these variants we need to add some constraints on some of the stats
# We want a resonalble number of average trades (more than 20), an average return that is at least greater than .5%, and
# a percentage of profitable trades higher than 40%
# Check to see if there are systems that satisfy these constrains:

fullResults <- join(svmR, svmC, nnetC, nnetR, by = "variants")
nt <- statScores(fullResults, "NTrades")[[1]]
rt <- statScores(fullResults, "Ret")[[1]]
pp <- statScores(fullResults, "PercProf")[[1]]
s1 <- names(nt)[which(nt > 20)]
s2 <- names(rt)[which(rt > 0.5)]
s3 <- names(pp)[which(pp > 40)]
namesBest <- intersect(intersect(s1, s2), s3)
summary(subset(fullResults,
               stats=tgtStats,
               vars=namesBest))

# only 3 of the trading systems satisfy these criteria, and all of them use the regression task (have an R at the end of their name)
# The Ret of the single.nnetR.v2 shows marked instability (pg. 153) so we will compare the other two which have similar scores:

compAnalysis(subset(fullResults,
                      stats=tgtStats,
                      vars=namesBest)) # it's ok if you get warnings here (pg. 154)

# Despite the variability of the results, the above Wilcoxon significance test tells us that the average return of
# “single.nnetR.v12”is higher than those of the other systems with 95% confidence. 
# Yet, with respect to the other statistics, this variant is clearly worse.

# Try Plotting to get a better idea of the distribution of the scores across all 20 repetitons:
plot(subset(fullResults,
            stats=c('Ret','PercProf','MaxDD'),
            vars=namesBest))

#The scores of the two systems using windowing schemas are very similar, but the results of “single.nnetR.v12” are distinct. 
# We can observe that the high average return is achieved thanks to an abnormal (around 2800%) return in one of 
# the iterations of the Monte Carlo experiment. 
# The remainder of the scores for this system seem inferior to the scores of the other two.

# Evaluating the final Test Data
# This section presents the results obtained by the "best" models in the final evaluation period. 
# This period is formed by 9 years of quotes and we will apply the five selected systems (pg. 156)

# obtain the evaluation statistics of these systems on the 9-year test period
# We need the last 10 years before the evaluation period-- the models will be obtained with these 10 years of data 
# and then will be asked to make their signal predictions for the 9 year evaluation period

#Check out our best model:

getVariant("grow.nnetR.v12", fullResults) # (pg. 157)

# Conduct a deeper analysis to obtain the trading record of the system during this period
data <- tail(Tdata.train, 2540)

model <- learner("MC.nnetR", list(maxit = 750, linout = T,
                                  trace = F, size = 10, decay = 0.001))
preds <- growingWindowTest(model, Tform, data, Tdata.eval,
                           relearn.step = 120)
signals <- factor(preds, levels = 1:3, labels = c("s", "h",
                                                  "b"))
date <- rownames(Tdata.eval)[1]
market <- GSPC[paste(date, "/", sep = "")][1:length(signals),
                                           ]
trade.res <- trading.simulator(market, signals, policy.func = "pol2")

#plot the results
plot(trade.res, market, theme = "white", name = "SP500 - final test")

# ** Question 9: Save your final plot as a .png and insert it into your word doc homework submission

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Jargon and the e-Science Community Blog

Yesterday, I wrote a post for the e-Science Community Blog, which is:

intended to serve as both a bulletin board for news, upcoming events, and continuing education/job opportunities as well as a forum that librarians can use to post questions or to initiate and engage in discussions. All librarians interested in the emerging area of e-science librarianship are welcome to participate!”

And as the new content editor for Data Management for the accompanying e-Science Portal, I’m expected to periodically write blog posts to the community site. I tried to think of something that could be kept around 600 words or less and would be relatable (harder than I thought it would be)– what I ended up with was a post on e-Science jargon. I wanted to elaborate and give more comprehensive examples, especially for the first section, but I guess that’s what this blog is for (post on teaching Computational Methods in Healthcare Informatics soon to come!). My first post for the Community blog is reposted below. The original can be found here.

Also, here’s the Prezi I mention in the post :)

Tips for New Data Librarians: Working Around the Jargon

Submitted by guest contributor Daina Bouquin, Data & Metadata Services Librarian, Weill Cornell Medical College of Cornell University, dab2058@med.cornell.edu

In my experience, making metadata part of the conversation is one of the hardest things about being a data librarian. That is to say, that working data information literacy and data management into the conversation with anyone who isn’t already concerned with these topics can be incredibly difficult– even more so when words and phrases like “metadata,” “version control,” ”data integration,” “semantic data structures,” “repositories,” and even “e-Science”  are so foreign. Data librarians need to learn how to navigate both the jargon associated with their field and the need to communicate these issues with their patrons, without alienating or disinteresting them from the start.

As a relatively new data librarian in a biomedical research setting, I have come to understand that I need to strategize how I introduce these topics– especially to those patrons who come to me hunting for analysis resources and aren’t as focused on the other issues inherent in data management and curation. Based on my own experience, these are some of my tips for getting around the jargon and getting things done.

First, leverage the discussions you are already having. Whether it’s talking about bibliographic management (managing metadata associated with literature) or finding literature to support someone’s research interests (talk a bit about their study design if you can) see how you can introduce e-Science topics into the conversation. For example, I was approached and asked to teach a class on using Prezi to the post-docs association at my institution, and used that opportunity to integrate data visualization and presentation basics into the topics I covered. The students found it valuable and the next time I taught the Prezi class I spent half the time talking about data vis and the value of having clean, well-managed data to make data visualization more simple and effective (this was at the request of those organizing the class). I have many more instances like this where reframing the conversation just a little led to a lot of data-related outreach.

Second, try to avoid a lot of jargon in your constructive criticism of a researcher’s current data management practices. Reframing the discussion to be relevant to the researcher is key. It’s very easy to confuse someone who isn’t familiar with the terms and concepts you’re discussing and it can come off as alienating and long-winded. Focus on asking questions and making consultations a constructive conversation– consultations are as much about learning about the researcher’s needs and how best to address them as they are about anything else. Ask them what their plans are if a lead investigator leaves, or if they have a secure backup strategy, or if they would like to explore more options for making their research more efficient– you don’t have to necessarily talk about “metadata” much, instead you can focus on data organization and research replicability which may be more straightforward.

Which brings to my last point, make the jargon secondary as often as you can– planning a data organization and collection strategy, discussing workflows and long-term storage are all words and phrases that are more straightforward than “data management planning”, “data collection instrument selection”, “data validation and audit capability”, “data citation” and “archiving”.  Literacy is incredibly important, but literacy goes way beyond just knowing the vocabulary. Try focusing on the strategies employed in the Data Curation Profiles Toolkit when doing consultations and interviews and familiarize yourself with the Glossaries of Data Management Terms and the DMPTool so you are sure you can explain what terms and policies mean when you need to, but focus mostly on making positive changes and being approachable– we all know change is hard, try making it friendlier.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

REDCap and Library Data Management

In my experience (over the past year) I have come to realize that as much as librarians talk about research data management, they’re actually pretty bad at managing their own research workflows– as are most people. I mean I myself have a growing backlog of blog ideas with messy notes scattered about, but I have a lot of trouble making time to sit and write here. So I do think it’s realistic to let some parts of personal data/information management fall by the wayside, however, in a professional setting this is quite a bit more unacceptable.

Case and point: my library has a service where some of our scientific research librarians will (by request) perform systematic reviews of the literature on complex biomedical topics. Now, a systematic review is a time-intensive and difficult process that involves multiple librarians spending hours searching and sifting through literature across multiple databases to come up with a comprehensive list of articles and other literature regarding a specific topic with particular study designs and constraints in order to foster the production of a meta-analysis– these meta-analyses make up the capstone of evidence based medicine and are commonly seen as the authoritative source of best practices in biomedical research.

And producing these systematic reviews means collecting a lot of data and metadata. You see, in order for a systematic review to be valid, like all research it should be replicable (though current data management and archival standards are lax in biomedical research– Damn you, NIH you already got the memo!). So ideally the librarians doing these searchers would keep their records in a way that required controls for versioning, comprehensive audit trails, controlled input/export formats, well-discussed archival processes, documented inclusion/exclusion criteria for articles, a set metadata structure and data architecture, and allow for multimodal interaction with the system in which the data/metadata was stored (e.g. writing notes, querying records, composing metrics, etc). But sadly this is just not the case with the workflow that my library currently has in place– or any that I have heard of for that matter. This is not to say that librarians are not keeping records and documenting their searches, it is to say that it is not being done in a way that is seamless or comprehensive. The librarians involved are making do with with what they know and what they have (these are not data librarians that I’m talking about so their training is not in computing or data-specific problem-solving– that’s my job). And as a result searches are tracked on wikis, in spreadsheets, in word documents, etc. without controls, a high degree of functionality, or sensitivity to archival practices.

So, I’m going to be working with my colleagues at the library to fix this problem. I have proposed and will hopefully very soon begin a relatively small (yet ambitious thanks to anticipated reluctance to change) project wherein I will be designing a data entry system and clarifying the workflow for the systematic review process. And I’m going to do it using REDCap.

redcaplogo

REDCap is a really great clinical data management platform available to us for free through our Clinical Translational Sciences Center. Any customization we decide to leverage may end up costing us some money, but I’m hopeful that others will see the value in it. I’ve already run the idea by our Associate Director of the Biomedical Informatics Program within the CTSC and she was very supportive of me finding new ways to use the REDCap platform and encouraged me to pursue this project as and its implementation as a research interest– so we’ll see if that happens, for now I’d just be happy to see better data management practices within my library :) I also want to make note here that I’m not saying that REDCap works for all data management problems (it doesn’t! e.g. varied schedule longitudinal clinical studies) but for us, it’s way better than me trying to develop my own SQL-based solution (time consuming) or even a simple MS Access DB, which would be easy to set-up but not comprehensive. I’m going with REDCap is because REDCap gives people even very new to the system the ability to design data collection instruments and forms and to put clear controls on fields with comprehensive audit trails and version control so librarians can clearly track the systematic review process. It is also fully HIPPA compliant and cloud-hosted to allow for easy access, security, and archival procedures. It will also give the librarians experience using a system at that is used by our clinical researchers so that they will be more adept and comfortable if questions about it arise as they interact with researchers on a day-to-day basis.

Wish me luck :) I’ll keep you posted.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Mad Amounts of Professional Development

"your proposal was accepted"

So “professional development” is really important in library land, but what is it? All over the place I read how important it is to continue learning new skills and developing an attuned understanding of “emerging technologies” in order to address the needs of patrons as they evolve. Networking and skills training are emphasized, and I’m pretty sure everyone has accepted just how cool MOOCs can be, but really, how one approaches professional development is going to vary a lot from person to person. I personally am not a big fan of webinars and would rather meet people at a conference or go through whole bunch of Codecademy tutorials (finally learning Python) than go to a workshop, but that’s mostly thanks to a few bad workshop experiences– they really don’t have to suck but sometimes they do and it always feels like a gamble to me. Plus all this takes time so you need to prioritize to make anything happen, so sometimes professional development falls to the back.

But the almighty Wikipedia entry on professional development tells me that professional development in workplaces refers to “the acquisition of skills and knowledge both for personal development and for career advancement. Professional development encompasses all types of facilitated learning opportunities…” And what I’ve been taking that to mean is just keep your eyes open– you need to find fun/cool stuff to do so that people know you’re good at things and like learning. And you actually do need to like what you’re learning to take anything from professional development stuff… at least for me, if I’m not having some fun I won’t really invest myself in what I’m trying to do and it’ll just be a big waste of time (e.g. an RDA workshop that made me want to die). So recently I’ve been trying to figure out what I’m going to do to keep up with professional development, while not feeling like I’m wasting my energy on things I  feel like I should do, rather than things I want to do.

So what I did was start scanning over and quickly reading through everything I could on every listserv on every medical and/or academic library / health informatics list I could get on– not as time consuming as you’d think. I just came up with a list of keywords for topics that I wanted to invest time in and searched the lists for those (hooray command-F!) and started organizing the results chronologically. Keywords were things like “data”, “management”, “research”, “funding”, “statistics”, “metrics”, “recent graduates”, “metadata” etc. I ended up with a list of emails detailing workshops, meetings, MOOCs, essay contests (applying to these still), research opportunities, jobs, and articles written by people in the field. Then I got started applying to everything that didn’t look like I could do something similar on my own and would help me meet people who I could learn from. What I ended up with is a short list of things that I think will help me get some more professional development in ways that work for me.

I took time on the weekends and a little at work to apply for some great opportunities on that list, and I’m happy to report than some of them came through :)

First, I was invited to join the Editorial Team at the UMass Medical eScience Portal for Librarians after applying in January. I’m  honored and excited to get to work with such a fantastic team as I start collaborating with them remotely as the new Content Editor for Research Data Management. As an editor I’m tasked with working in collaboration with the rest of the eScience Portal Team to provide and manage information that meets the scope and purpose of the Portal– so basically I need to manage the resources that are collected on the portal regarding Research Data Management. I’m going to be researching and gathering links to web resources, creating blog posts, and managing links and other materials related to research data management as needed on the portal site. I will also need to attend some in-person meetings with the rest of the team as we go about redesigning the portal and looking for ways to further develop resources for our community. I really do hope this opportunity  is something I can build on and continue to grow with without it being too burdensome a time commitment– I’ll keep you posted :)

Next though, I am happy to say that I have been accepted into the first cohort of scholars at the Institute for Research Design in Librarianship at Loyola Marymount University in LA. I am soooooo excited for this one :) I found out about it with less than two weeks before the application deadline and had to scramble to put together a research proposal, but I’m happy to say I was accepted. The Institute is designed to help support and train library researchers and to help them develop professional research networks as they embark on their first attempts at comprehensive research and publishing in peer-reviewed journals. You can read an abbreviated version of my proposal below:

IRDL_BouquinProposal

So I’m going to LA this summer and I cannot wait. It’s been a lot of work to put all of this together and continue with my professional development this way, but I think this just works better for me. I’ll continue participating in consortia and online learning opportunities, but for me, working with others on concrete projects makes me feel better about the skills I’m developing. I’m excited to see what comes next.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Getting Connected: Librarian Meets Clinical Data Management

I’m pretty sure that no matter where you go, the bigger the organization, the more communication problems you’re going to encounter. I think that’s the case with my current institution; you need to be plugged in culturally (and technologically) in order to get anywhere and find out what’s going on. This is what I’ve slowly been trying to do more and more of for the last few months as I’ve finally gotten connected up with an important group at Cornell– Architecture for Research Computing in Health (ARCH). This group is charged with developing secure repositories for clinical research data wherein the needs of the researchers are more complex than what current data management solutions available to them offer. I’m loving finally getting to work with them, even if it took 5 months to get an in– the delay was thanks to poor communication and a lot of assumptions being made up high in the hierarchy, but I made it happen so I’m happy anyway– it was a process I’m hoping to learn from. This clinical data repository program is trying to fill gaps that aren’t met by tools like REDCap and focus on long-term storage and curation of clinical research data from a variety of sources– this includes but isn’t limited to Biobank data, electronic health record data, clinical trial data, etc. What I’ll be doing though is helping design our metadata dictionary user interface and train researchers on how to use it :) I’ll have screenshots and examples of the interface soon, but basically the idea is that with systems this complex (dozens of tables and hundreds of variables) you’re going to need a tool to explore what’s available and understand things like when the data feed was updated, who owns the data, how a variable was measured, access permissions surrounding the data, data collection dates, etc. The interface should help users explore what’s available in our repositories and apply for different levels of access to the data to conduct various clinical research activities. The tool is also going to have a query builder built into the interface to allow researchers to begin generating reports and creating subsets of data for analysis. I’m also consulting on what metadata should be included in the metadata discovery tool and what can be left out. It’s an interesting process that I’m glad to be involved with.

Really though, it seemed the biggest obstacle in breaking in to the clinical data realm and getting a seat at the table, was getting the proper introductions. People I didn’t know existed, didn’t know I existed, and the people needing to make the connections, didn’t know they should make the connections. It also didn’t help that the group I was trying to get connected to had nearly no web presence– they still don’t have a website. The only advice I can really give in this situation is to continue insisting on learning more about what everyone around you does– don’t get too hung up on what you’re doing to notice the work being done by your colleagues and other working groups. By asking to find out more about what people I met were working on, I was able to realize who I should be introduced to and to get my supervisors to make the necessary arrangements for me to meet up with these people and find ways to collaborate. I’m now sitting in on their weekly data-issues meetings and getting to be involved with the whole building-out of the repositories. If only they taught you in library school just how much of the job is networking and navigating the politics of higher education. Learning by the seat of my pants.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Data Librarianship: Updates and Exploring

I’m a few more months into being a new librarian, and the thing that’s hardest so far isn’t trying to do what others ask of me– it’s getting myself to prioritize without smothering my itch to branch out and try new things.

First though, here’s a few updates on what I’ve been doing.

Lately at work I’ve been focusing on developing a collection of resources and tools to act as a sort of concierge for meeting researchers’ data-related needs. I’ve been continually updating my resource guide and have given a few presentations at research faculty meetings on the services I’m offering. I also presented this Prezi at our library faculty meeting about what I’ve been doing and what my goals are moving forward. You can click through it here.

As a result of the presentations though I got a lot of useful feedback and have started helping researchers in our public health department with various requests, and was invited just recently to help teach a few topics for a new Computational Health Informatics course currently being developed. I was asked by the Professor designing the course for suggestions on textbooks that could act as an introduction to data mining and visualization methods that wouldn’t be too overwhelming for beginners, and that could possibly make use of a common algorithm suite (This is the one we decided to go with). But as a result of those conversations I’ve been asked to potentially help with instruction or to become embedded as a liaison from the library into the computational informatics programs. So we’ll see how that goes :) Keep your fingers crossed for me.

I have however, started getting some teaching experience. This past week I gave two lectures on data management and the data lifecycle, along with finding and using data and statistics. I was absolutely amazed at just how new most of these concepts were to the students. The course I taught in is called Information Skills, and it’s a part of the MS program in Clinical Epidemiology and Health Services Research program here at Weill Cornell. I found myself wishing I had a full month to teach them instead of just two days (the course is broken up into topics), but I hopefully got them thinking and they’ll come to me for more information. I’m planning on writing a whole other post about it after I get a little more teaching under my belt, but for now here’s something I made for the class– I developed my own simplified diagram of the data lifecycle for them that includes a description of a data management plan so they have a quick reference to some of the broader ideas we covered:

Data Cycle

Outside of teaching and finding resources though, I’ve started getting more involved with initiatives within the other Cornell University Libraries. I officially became an active consultant for Cornell University’s Research Data Management Services Group just recently, and have been working with other groups like our Cornell Institute for Social and Economic Research to get access to remote labs so researchers down here in the city know that they have access to an amazing number of computing resources through our main branch in Ithaca.

But really, I’m kind of starting to want to branch out more at work into other fun areas. And here’s where I’m starting to have to grapple with wrangling myself in so I don’t get overwhelmed or in too deep. There’s so much fun technology out there that I find myself wanting to jump into all kinds of things all at once, because being more familiar with them could help me work with researchers and expand on the types of services I’m trying to offer, but I need to draw some lines. For example, I’d love to learn Python and Processing, but I can only do so much at once, so I have to prioritize what I’m going to put my energies into. So I’ve started wandering around the spooky world of data visualization and have decided to start investing some serious time in expanding my skills with R and learning D3.js. You see, I’ve had a few researchers approach me with ideas to try to demonstrate how collaborative and effective their working groups are by visualizing their bibliographic histories. What that means is that I’ve been creating visualizations for researchers illustrating their history of co-publishing papers together. For example:

TRIPLL2013_MenteesRedHere I’m showing the co-publication history of individuals within a Translational Research Institute on Pain Later in Life working group. The thicker the network edge is the more times these people published papers together. The red individuals are mentees within their mentoring program and they’re trying to demonstrate their interactions for inclusion in their grant re-approval process. These are fun to make (I used Sci2 and a lot of my own editing) but once you start adding more individuals, it starts being a messy hairball and isn’t very useful– this one’s for one of our cardiology groups:

MittalCoAuthorFinalNetwork

 So the problem here is the limitations in my current methods, and the desire to expand this service, because I really freaking like making visuals. So here’s the D3.js solution. Play around with it and you’ll see why this is so much better than what I’ve been doing. I’ll keep you posted on whether any of these attempts are successful :) And I’ll keep exploring.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Being a New Data Librarian and Author Disambiguation

It’s hard to describe what the last few months have been like. It’s all gone by so fast but it’s been so exciting. Since my last post, I finished my Master’s Degree in Library and Information Science and my Certificate of Advanced Study in Data Science at Syracuse U.466462_4397661793117_1641166480_o

I then proceeded to pack up my life and my cats and moved to Brooklyn.

941049_4668916965667_2138588821_n

1004077_4994270779309_1252144280_n994758_4811509210384_1617181827_n

And about two months ago I started my new job as Data & Metadata Services Librarian at Weill Cornell Medical here in the city.

999028_4816790942424_82285603_n

993957_10200198733275796_1862832885_n

I love my job, and now that I’m a bit more settled into the position, I figured it might be fun to write a post about my first experiences here and show some of the stuff I’m working on.

So first things first, what am I doing? Well, when I first got here, that was the big question. It seemed to be that I had the opportunity to shape the position quite a bit, and I found myself trying to figure out what I’d be focusing on while dealing with the bureaucratic mess that is academic on-boarding… After spending a good week and a half trying to do everything from gaining access to the right parts of the building, getting on the right listservs, setting up my new computer, and trying to figure out what I didn’t know (needed to know) I started to construct a guide to both establish my presence in the library, and to outline the services I wanted to provide.

You can explore my guide here 

My basic goals with the guide are to help researchers and students make the most of their data. This means helping them to better understand the obstacles and benefits of data-intensive science, and to not be intimidated by new ways of analyzing, visualizing, and sharing their research data. I’m hoping that the guide will give me useful user feedback so I can further develop my plans moving forward– what types of questions I get on the resources and information I’ve listed will hopefully inform my outreach strategies.

In order to develop concrete services though outside of the library itself, I’ve started consulting and helping with a few projects that had already been in their planning stages when I was first hired. These range from helping a researcher develop a new data model for a longitudinal study on consciousness in coma and brain trauma patients, to finding resources for co-funded grant appointees on the NIH Data Sharing Policy, and helping develop the metadata procedures for a new curriculum mapping project for the college. I’ll be writing more posts outlining and delving more deeply into my projects as they continue, but to start, I’m going to focus on an author disambiguation project my colleague Paul Albert designed– I was brought in to help with these efforts a few weeks ago.

Put simply, the goal of the author disambiguation project is to develop an improved method of clarifying and quantifying who individual authors are in the scholarly literature. Many authors publish under different versions of their names (e.g. Daina R. Bouquin, D.R. Bouquin, Bouquin, Daina R., D.Bouquin, etc.) in addition to having the same name as another author (e.g. more than one John Smith, Sam Lee, etc.) and therefore it is difficult to quantify the contributions of an individual or ascribe other metrics to bibliographic data. Researchers also don’t tend to maintain their scholarly profiles very well, particularly their publication information, so using that data isn’t very reliable. This presents a major issue for those seeking to make use of this data. Paul’s idea is for data researchers and home institutions to collaborate with other author disambiguation initiatives to leverage existing data about individuals, along with institutional and global tendencies, and take advantage of end user feedback to build a better author disambiguation tool.

2013-VIVO-Disambiguation2013-VIVO-Disambiguation[1]

Though there are many techniques already being used to overcome problems associated with ambiguous authorship, the technique we’re using here novel– we’re using MeSH terms and well-disambiguated institutional data, in combination with a measure of dissimilarity known as the Jaccard Index to predict how likely it is that two journal articles have the same author. We are looking for collaborators (so look at the poster and contact me if you’re interested!) and hope to further develop this idea into a tool that other institutions can contribute to. As institutional data improves, the idea is that the tool’s accuracy will improve. The above poster elaborates more fully on the goals and underlying logic.

And just for fun here’s some sample R scripts that I wrote to come up with Jaccard indices between journal articles to predict how likely it is that they have the same author. 

Screen Shot 2013-08-08 at 3.33.43 PM

So, for now, I’m having a lot of fun exploring new things and learning. Hoping to keep doing that.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Data Librarianship!

So I’m back after a bit of a hiatus. Thanks to life intervening and then finals, various applications, end-of-the-semester madness, holidays, and much more, I haven’t had much time for blogging. But here I am and with exciting news– this spring I will be moving to New York City and starting my career at Weill Cornell Medical College‘s Samuel J. Wood Library as a Data & Metadata Services Librarian.

Weill Cornell Medical Center

A view from above.

Here’s where the library’s located.

I’m so excited. Having been considering various paths for after I complete my MLIS, I have decided that a job like this is more than I could pass up. I’ll be able to get my hands dirty and really see what’s going on with researchers and what their data needs are; this will only benefit me if I decide to go back and pursue the PhD.

So this brings me to Data Librarianship… what is it?

Data Librarians have a difficult to define positions, not only because we have a relatively new job title, but also because we have so much potentially on our plates. In order to fully explain the situation Data Librarians are working with, I’m going to first introduce the shift in research methodology rapidly taking place within the scientific community known as the “fourth paradigm“, or the move toward science that’s based on data-intensive computing.

You see there have already been three other paradigms. A thousand years ago, science was empirical (just describing natural phenomena) then in the last few hundred years it became theoretical (like with Newton’s Laws and whatnot) but then the theoretical models started to get too complicated to solve analytically, and so came the third paradigm: computation. Computation allows for complex phenomena to be simulated, but now these simulations are producing a ridiculous amount of data, so what comes next? This is the fourth paradigm: data extrapolation. Data extrapolation unifies theory, experiment, and simulation, which is cool, but complicated and messy. Research in the fourth paradigm relies heavily on the ability to share data and collaborate across disciplines.

For example, data extrapolation and sharing is speeding up biomedical research in areas like Alzheimer’s disease research wherein The National Institutes of Health are collaborating with the private sector to help bring about earlier recognition and treatments through a project called the Alzheimer’s Disease Neuroimaging Initiative. ADNI was launched in 2004 to improve clinical trials on Alzheimer’s treatments as currently, the only way to diagnose Alzheimer’s is through post-mortem biopsy. The initiative is working by combining data from multiple volunteer subject groups with several diagnostic methods (including MRIs, and PET scans) from healthy individuals, on up through those showing various stages of impairment before being diagnosed with the disease; the hope is to identify markers that would help doctors track and treat it. The initiative is enabled by the researchers ability to share data between the 14 different centers working on the task, and by them making their data publicly available within a week of its collection– this catalyzes the energy of neuroscientists both within those centers and all over the world. There have been tens of thousands of data downloads from the ADNI’s website and several dozen papers published using ANDI data, a significant number of which were not authored by researchers being funded by the project (1).

Now this new realm of discovery is super exciting, however, data-intensive science comes with an pile of challenges that luckily librarians are uniquely suited to address (enter the Data Librarian). By fostering transparency and reproducibility, and long-term community access to information, librarians have always been an integral part of the scientific research enterprise, and these principles are no less valuable when it comes to data. It all comes down to the librarian’s ability to focus on three areas in particular: the capture, curation, and analysis of data for meaningful use. And curation is key.

I found the best definition of curation I came across to be provided by the UK eScience All Hands Meeting in 2004, which describes curation as:

“The activity of managing and promoting the use of data, starting from the point of creation to ensure its fitness for contemporary purposes and availability for discovery and reuse.”

So really, the act of data curation requires the librarian to be a person whom researchers can look to for help with all aspects of the data life cycle (which will probably be the topic of a post all its own).

https://www.lib.umn.edu/datamanagement/archiving

By working with researchers to plan and meet the needs they have for their specific data, the Data Librarian can be a part of the translation of raw data into usable applications by helping make the researcher’s goals a reality (and also get involved with a really cool research area called translational science). The Data Librarian can help facilitate conversations across disciplines and within institutions to help foster the creation of data-intensive solutions to complex problems. These librarians need to have a knowledge of analytics, statistics, research methodology, and discipline-specific vocabularies in order to effectively communicate with researchers, and I believe data librarians can do just that. Through study, continual assessment of strategies, and by keeping up to date with new technologies, these objectives are achievable.

Data librarians also have the ability bring to light curation issues like archiving, which may be overlooked by researchers who have not yet had to consider how their research data will faire as technology regarding storage changes. Archiving ensures that data is properly selected and stored, and easily accessed in a logical way that upholds the data’s integrity over time— this means identifying the proper metadata and access points for the researcher’s work as the data must be able to validate their research findings long after their research has been published. Which brings me to yet another reason why librarians are important here: librarians can help researchers create data management plans, which are becoming an essential part of acquiring funding from agencies like the National Science Foundation. As of January 2011, the NSF has begun requiring that grant proposals must include a supplementary data management plan in order to be considered for funding (and the NSF is not the only group making these mandates). The requirement of Data Management Plans by top funding agencies is evidence of the importance of data services in libraries and I can’t wait to get involved.

I’m excited to get to start my career as data-scientist/librarian at Cornell Medical and I hope to make a dent there. My job will primarily be comprised of the following responsibilities:

Provide data consulting and project analysis for faculty and students in support of their teaching, research, and learning needs. Helping to develop data services and promoting these services through outreach to support data analysis, data visualization, metadata creation, and data curation. Advise on policies, standards, and workflows regarding institutional data projects and collaborating on data-intensive research with the greater Cornell University community. Serving as an Active Consultant for Cornell University’s Research Data Management Services Group.

We’ll see what happens, but I have a feeling it’s going to be fun.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

NYC Data Week

Last week I was sent by the Syracuse University iSchool to be a student representative at NYC Data Week. I got into Manhattan on Wednesday and stayed through Friday, which gave me the opportunity to do a sufficient amount of exploring and get back to Syracuse before Sandy hit the coast (Donate to the Red Cross to help with the disaster relief efforts here). While I was there I got the chance to see some incredibly insightful, and in some cases beautiful, applications of data science and data analytic procedures– as was the case with some of the exhibits at the Data Visualization Showcase, particularly Wind Map by artists Fernanda Viégas and Martin Wattenberg*. The Wind Map is the artists’ attempt to best represent the available data from the National Digital Forecast Database on wind patterns in the United States. It really is a dynamic piece that  shows how data moves beyond numbers and how visualizations can help tell the data’s stories.  The showcase got me thinking about how I could go about creating some visuals for my own projects (a future blog post for sure).

Still image. Click the link to see the real map.

But aside from the visuals, the presentations I went to came from just about every perspective I could have imagined. There were leaders there from all kinds of industries ranging from biotech groups like the NY Genome Center, to international organizations like UN Pulse, and even the fashion company Estee Lauder. But each of these representatives had drastically different ideas on what it takes to be a data scientist, which to me is quite fitting. Because data science is such a diverse field, it makes sense that different industries and organizations would weigh some variables more heavily than others. And although everyone I spoke to agreed that a data scientist needs to be capable of solving complicated data-related problems, there was controversy over what skills and what level of expertise is needed to do so. It seemed to depend heavily upon the nature of the given problem; NYC Department of Information Technology and Telecommunication for example, is very interested in making data readily available to the public and to foster civil engagement. NYC DOITT therefore focused on the need for experience in public policy along with the ability to be a skillful communicator and savvy with statistics. Generally though, there was consensus about a few broad characteristics employers are looking for in future data scientists (this is in no way a comprehensive list):

  • Expertise in some scientific discipline or business
  • Ability to work with mathematics, specifically statistics and the computer sciences
  • Experience with relevant tools– these are specific to different work environments and problems
  • Good communication skills
  • Ability to work within a team environment
  • Curiosity

It wound up being Amy O’Connor from Nokia who eventually presented the curiosity element to the equation as an add-on to Drew Conway’s Data Science Venn Diagram.

O’Connor spoke about how although all of the skills represented above are necessary in a data scientist, it’s not expected that any one data scientist is strong in all of these areas, hence the need to work as a team. Some people may come to the table with a strong background in math and statistics, while others may have much more knowledge of machine learning and traditional research skills (the latter being more of what I’m experienced with, though I’m working on my stats skills). However, what was most pressed upon in her discussion of how to “build” data scientists was the importance of curiosity as a key feature that encompasses all of the other skills.  Prior to this talk, I had mostly heard people emphasizing what tools are useful to learn or understand (R, Hadoop, Python, LiNQ, Massively Parallel Processing, Map Reduce models, etc.), but I hadn’t really heard many people focusing in on how much overlap there needs to be between disciplines in order to be an effective data scientist– you need to have multiple skills and curiosity to get there. There had also been very little mention at all of personal characteristics or of what drives people in wanting to become involved with “big data,” and I think motivation is key when you’re dealing with puzzles constantly.

So I found the idea of curiosity to be both insightful, and meaningful to what turned into a very complicated conversations about what I should do to better develop myself for work with data. Because I have a background in Library and Information Science, along with Data Science, I am interested in a pretty wide range of possible data-related occupations– data librarianship within academia/medicine and consultant positions being jobs I’m particularly hopeful for. I can see though the value of curiosity in any of the careers I’m looking into, and just about any career that allows you to think creatively. I mean really, how creative can you be if you’re not curious?

All in all, I realized that there are some tools I need to focus on getting more proficient with or learning (R and Hadoop respectively), but I also need to stay curious. And I guess that means I need to keep hunting out more data to mess with.

*Wattenburg also has some other really amazing visualizations that weren’t on display at the conference. Check out Shape of Song, it’s one of my favorites.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

« Older posts

© 2014 Daina Bouquin

Theme by Anders NorenUp ↑