And counter to the guidance I site below to be “single-minded” in my endeavors to learn new skills, as the weather has gotten colder, I have begun eating lunch at my desk– as such that means I’ve got about an hour in the middle of the day where I’m trying to plug away a little more on the skills I’m trying to learn. Namely, I’ve started a Web Development course offered through Udemy :) Hopefully in about 2-3 months I’ll have worked through it all and the ProGit book I reference and have a whole new pile of tricks up my sleeve! Wish me luck. Basically I’ve told myself I need to stay rigorous and as often as possible do the web dev lectures over lunch, and the git book in the evening when I have a bit more time.
And just so you don’t think I don’t do anything else with my time, on the train, lately I’m reading too many poorly written murder mysteries like this one (can’t not read). I’ve also been swimming in the cold cold ocean :) This pic was taken at 6am just off Coney Island last Friday. The air temp was about 46 and the water temp was 53.
But now, on to the blog post!
Submitted by Content Editor Daina Bouquin, Data & Metadata Services Librarian, Weill Cornell Medical College of Cornell University, firstname.lastname@example.org
The role of the data librarian extends far beyond helping researchers write data management plans. Rather, librarians working where data-intensive science is happening spend their time answering questions about the entire data life cycle—data pre-processing, analysis, visualization and data validation are all important, and sometimes highly intricate, parts of the research process. As a data services librarian I have personally found myself advising researchers to rework their workflows to make use of tools they have available to them help make their research more replicable, efficient, and shareable at these various stages of their research process. Unfortunately though, I do not always have hands-on experience with the tools and techniques I’m advising researchers to use– nor is it possible for me to always have experience using every tool out there available to researchers in computational environments. However, I do believe it’s important for me to get as much hand-on experience as possible with the most useful, commonly used tools, so that I can develop both refined expertise in my field, and also empathy for my patrons. E-Science Portal editor Donna Kafel recently wrote a wonderful post where she reflected upon, and pulled advise from others about self-learning and the challenges associated with it. Here, I aim to outline how I’m making use of some of the excellent advice offered in that post, while focusing in on an area of the data life cycle that I believe is sometimes oversimplified in discussion—I’m referring to the version control processes inherent in good data management.
“Be single-minded. Identify one topic or skills you want to learn and focus on mastering it.” – Donna Kafel, Challenges of Self Learning
I decided the advice I would take to heart most fiercely from Donna’s self-learning post was the above take-away. It rang true with me because I regularly encounter problems by trying to tackle too many new topics at once. If I don’t use something regularly, it’s difficult for me to become proficient—especially with technically challenging tools. It makes sense that I should focus more on mastering a single skill before moving on to anything new, but how to choose what to focus on? This is where Version Control Systems (VCS) or “Revision Control Systems” come in. VCSs are incredibly diverse in both complexity and application, and while I rarely see them discussed at length by librarians, I find them to be exceedingly important to researchers in collaborative environments. I regularly read discussions on file naming as an approach to control versioning and to aid researchers in a multitude of data management processes, and I do not want to discredit that discussion because it is so important (check out some of the great writing on this topic right here on the portal blog!), but I’m hoping to extend that conversation a bit more in this post. Below I focus in on Git as both a self-learning opportunity and incredibly useful VCS.
Git is a technology that “records changes to a file or set of files over time so that you can recall specific versions later”1. You can use Git for just about any type of file, but it is primarily used by people working with code files. Often times, people use simpler version-control methods, like copying files into a time-stamped directory, but this tactic is risky—one could forget which directory files are stored in or accidentally write over the wrong file (file naming helps here), but an even better approach is using a tool like Git. 1
Git is what is called a Distributed Version Control System (DVCS), but it is easier to understand DVCS if you first understand Centralized Version Control Systems (CVCS). CVCSs have a single server that contains all the versioned files a group of people are working on. Individuals can “check out” files from that central place so everyone knows to some extent what other people on the project are doing. Admins have control over who can do what so there is some centralized authority making it easier to manage than local version control solutions. Examples of CVCSs include the popular Apache tool Subversion. 1
There are though some drawbacks to using a CVCS—namely, the single server situation. If the server goes down, not only can no one can make any changes to anything that’s being worked on, but if the server gets damaged and is corrupted, the individuals working on the project are completely reliant on there being sufficient backups of all versions of their files. This is again, quite risky.
To mitigate this problem, DVCSs were developed. In distributed systems (like Git) people do not just check out the latest version of a file, they completely “mirror” the repository. In this way if the server dies, anyone who mirrored the repository can copy back to the server and restore it. Every time someone checks out a file, the data is fully backed up
Distributed systems are also capable of working well with several remote repositories at once, allowing people to collaborate with multiple groups in different ways concurrently on the same project. 1
However, I did not decide to focus my single-minded self-learning on Git just because it is so useful for version control—I wanted to focus on learning as many skills as possible, while still staying focused. You see, in learning to use Git, I’d have more opportunity to learn about Bash Unix Shell. Having some background in using command line interfaces, I am still a beginner with the Terminal and figured that learning Git would get me much more proficient with navigating my computer via the command line, which in-turn could help me get up the confidence to learn how to use a Linux operating system. Learning Git would also help me learn how to use GitHub, which is growing by the day in popularity as a place for people to store and share code. The GitHub graphic user interface would also help get me off the ground. So I found Git to be the great door-opener to many other skillsets on my list of self-learning goals.
Thus, I have begun learning to use Git and GitHub. I was able to get some hands-on experience with it by participating in a Software Carpentry Bootcamp this past summer, but didn’t find the time to dedicate to following up on it– I was not staying focused on learning a single new skill. So now I am re-grouping. I have primarily been using the resources I am providing below, however there is so much more out there. These resources are just a great place to start, and having made some headway in my own reading of these documents I hope to be trying out Git more in the very near future.
Pro Git Great free eBook and videos on getting started with and better understanding Git and version control. I used this excellent book in writing this post.
Pro Git Documentation External Links Tutorials, books and videos, to help get you started.
Even if you don’t think learning to use Git is right for you, learning more about the tools researchers are using to work with their data and getting a look under the hood about how those technologies work can be a great way to continue to grow professionally. I hope you all have the opportunity to join me in exploring a new skill and share your experiences with the e-Science Portal Community.
1. Chacon, S. (2014). Pro Git. Berkeley, CA: Apress. http://git-scm.com/book/en/v2
And just incase you weren’t already overwhelmed, here’s a great TED Blog on places to learn how to code!