
Dr. Domenico Giusti
Paläoanthropologie, Senckenberg Centre for Human Evolution and Palaeoenvironment
/data
, /figures
, /src
, /doc
, README
, LICENSE
)View()
, str()
and summary()
functions:
operatorseq()
and the seq_along()
functionsrep()
functionlength()
function"Simple approaches that involve less typing are generally best".
"It's also important for your code to be readable, so that you and others can figure out what's going on without too much hassle."
>
, >=
, <
, <=
, ==
, !=
|
, &
, !
paste()
functionComplete swirl modules '5: Missing Values' and '6: Subsetting Vectors' and submit your successful completion via email (domenico.giusti@uni-tuebingen.de)
swirl modules won't count towards your final grade but are highly recommended to follow. Those topics will be covered on the final exam.
If you don't have it already, now might be a good time to install Git and sign up on GitHub to get started.
(Git is a command line program. Do not install any GUI client)
Make your work more organised, efficient, collaborative, and reproducible!
I use both for keeping track of changes in manuscripts, presentations, lectures, data and data analysis projects. I use Git mostly as a version control system of my projects, and GitHub for social networking, for exploring others' projects and eventually contribute to them. By hosting your repositories, GitHub also works as a cloud back-up service.
RStudio has super simple point-and-click version control (via git) baked in for “projects”. No excuses for not using it! (via @danmcglinn) @phylorich
"If a dataset that is the basis for computing a scientific result changes without version control, reproducibility can be threatened: results may become invalid, or scripts that are based on file names that change between versions can break. Especially if original data gets replaced with new data with no version control in place, the original results of the analysis may not be reproduced." The Turing Way - Guide for Reproducible Research
This image (and the previous one) was created by Scriberia for The Turing Way community and is used under a CC-BY licence
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. Using a version control system (VCS) also generally means that if you screw things up or lose files, you can easily recover.
In Distributed Version Control Systems (such as Git), clients don’t just check out the latest snapshot of the files; rather, they fully mirror the repository, including its full history. Thus, if any server dies, and these systems were collaborating via that server, any of the client repositories can be copied back up to the server to restore it. Every clone is really a full backup of all the data. Pro Git book
"With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored. Git thinks about its data more like a stream of snapshots." Pro Git book
"Most operations in Git need only local files and resources to operate — generally no information is needed from another computer on your network. [...] Because you have the entire history of the project right there on your local disk, most operations seem almost instantaneous."
"This also means that there is very little you can’t do if you’re offline"
"Everything in Git is checksummed before it is stored and is then referred to by that checksum. This means it’s impossible to change the contents of any file or directory without Git knowing about it. [...] You can’t lose information in transit or get file corruption without Git being able to detect it." Pro Git book
A checksum [SHA-1 hash] looks something like this:
24b9da6552252987aa493b52f8696cd6d3b00373
Git has three main states that your files can reside in: modified, staged, and committed:
The basic Git workflow goes something like this:
benmarwick/1989-excavation-report-Madjedbebe
Research compendium of data, code, and text associated with the publication: Clarkson et al. 2015. The archaeology, chronology and stratigraphy of Madjedbebe (Malakunanja II): A site in northern Australia with early occupation. Journal of Human Evolution 83, 46–64 http://dx.doi.org/10.1016/j.jhevol.2015.03.014
--> Marwick, B. Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation. J Archaeol Method Theory 24, 424–450 (2017). https://doi.org/10.1007/s10816-015-9272-9
Good news: RStudio has git built in. Bad news: The set of things you can do with git through RStudio is quite limited, but it might be enough for your need, and, if you’re doing your work there already, it is quite convenient.
Once you've installed your preferred version control system [git], you'll need to activate it on your system by following these steps:
git config --global user.name "[name]"
git config --global user.email "[email]" # same as GitHub account email
git config --global --list
Now the top right panel has a Git tab!
Now the top right panel has a Git tab!
Now the top right panel has a Git tab!
Advanced version control systems such as Git allow non-linear development of your project with branches. "A branch creates a local copy of the main repository where you can work and try new changes. Any work you do on your branch will not be reflected on your main project (referred to as your master branch) so it remains secure and error-free. [...] When you are happy with the new changes, you can introduce them to the main project." The Turing Way - Guide for Reproducible Research
Tipical scenario: when you collaborate with others, and everyone works on the master branch simultaneously, there could be a lot of confusion and conflicting changes.
So far we’ve worked with a local Git repository...
"Remote repositories are versions of your project that are hosted on the Internet or network somewhere [GitHub]. Collaborating with others involves managing these remote repositories and pushing and pulling data to and from them when you need to share work." Pro Git book
As research becomes increasingly collaborative and multiple people work on the same project, it becomes difficult to keep track of changes made by others if not done systematically. Moreover, it is time-consuming to manually incorporate the work of different participants in a project, even when all of their changes are compatible. Hosting the project on an online repository hosting service like GitHub is beneficial to make collaborations open and effective.
git clone
- create a local copy of an online repository
git pull
(fetch + merge) - update the local version of an online repository or pull other’s work into your copygit push
- push changes to the remote online repository