Version control

I’ve mentioned some things briefly about version control in the reproducible research chapter, as it appeared as one of the four pillars constituting reproducible research. So what is version control?

In the broad sense, version control is a way of keeping track of different versions of something. This something can be a process, a design, a document, an analysis or even a part of a software. The means of keeping track of different versions can vary and can be by simply making separate copies of the file for each version or using dedicated software to automate that process.

In this chapter, I want to argue in favour of using a software to automate the process of version control, but the principles can be applied to manual systems as well. It’s up to you to implement them.

The elements of a good version control

To make it clear once more: version control can be applied on any kind of concept that changes over time. But for now, let’s use a piece of software as an example of such a concept. We usually need to write some code for performing an analysis in our computer. This software gets updated along the way as new features are added and bugs are fixed. At each stage, the software can either be in a theoretically stable version or in a certainly un-stable version. I use the words ’theoretically’ and ‘certainly’ as one can only assume that a piece of software behaves exactly as she desires, while on the other hand she can be certain that it doesn’t perform according to the specifications. That is an important distinction to make, as a piece of software that is known to be un-stable should not be trusted for its results, while stable software could be used with caution. Therefore, our version control should be able to make the distinction between stable and unstable versions for us.

Once we know which are the trusted stable versions, we need to be able to use them to use them against our data to be processed. A simple way to test a software is to use test driven development, a topic that is covered in another chapter of this book. Therefore, our version control should allow us to jump back to older versions and check how the code behaved then.

Let’s say we figure out that our software is not behaving the way we want to in our current version, but we know that it was working well a few versions ago. Wouldn’t it be great if we could narrow down the changes that we made, in order to figure out where was the bug introduced? But we made changes in so many lines, and it’s impossible to remember it all… That would be a cool feature for our version control, right? To show the differences in our concept between different versions. Additionally to that, especially if you’re working in a team, you might want to know who made the changes and when, so that you can contact her and ask for clarifications.

Manual version control?

The most common way of keeping track of something in your computer that changes over time is usually by creating either different versions of a file with a slightly different filename (let’s say, by appending a date or a version number), or by creating separate folders, again by slightly altering their name. In the folders, the files reflect the state at the corresponding date or version.

The benefit and the reason for the widespread usage of this approach is its simplicity and no requirement of any particular skills. Unless creating a renaming folders is considered a significant skill. But the benefits of the manual version control stop right here. Having all these different folders, not only makes it very confusing to determine the correct order of each version, but also very difficult to come up with different names for the files and folders. Additionally, especially if you’re working with projects with a lot of files, it can have a significant impact on the disk space usage. Imagine if for every single file you are changing, you need to make a copy of all the project’s files. Doesn’t sound very efficient.

In all fairness though, manual version control makes it very easy to test previous versions and evaluate when did a certain problem appear. However, only determining when the problem appears does not help: one needs to know what was changed in order to solve the problem. Unfortunately manual version control does not help in determining differences between versions, one of the most important element of version control. In order to do that, you would need to go through all the files and compare them line by line for changes. I hope you have better use of your time than this.

Software based version control to the rescue?

Since, all these are very common problems, some people decided to try to solve them using software. They developed what is called VCS or version control systems, which is software specially for keeping track of different versions of files. This kind of software is trying to address all the elements of proper version control:

  • Storing different versions Usually, the VCS is creating a database of all the different versions of each file and places it in a specific folder. Therefore, you don’t have to create the folder structure and come up with the filenames yourself. That saves a lot of time and mental power, which can be used for what you’re actually working on.

  • Finding differences between versions Together with the ability to store different versions, VCS also allows for easy comparison between them. Using simple commands one can check the differences between any two versions in the project. This highlights changes at the level of single characters added or removed in a very fast and easy way

  • Understanding what each version was about Usually with most advanced VCS, the user is forced to add a short descriptive message for each version. Also the time when the new version was created is stored automatically. Therefore, when checking previous version, you can easier understand what were the changes about by reading the version message.

  • Efficient storage of versions Usually, when working with text files, there aren’t so many things changing in your project. Therefore, it is rarely efficient to store the files of the whole project under a new version. Software based version control takes care of that and stores only the files that were affected, or even more just the parts of the files that were changed.

  • Collaboration Using a VCS, usually means also a lot of support when working in teams where you need to distribute the changes. There are VCS using a server-client model, where the project with all the versions is stored in a central location and everyone can synchronise their changes with it. However, lately, a more flexible and distributed model exists, where each client can also be a server. In both of these cases, it is a fairly easy task to synchronise your changes and versions with your team-mates over the internet, even if they are located physically on the other side of the world

Do you need more than this?

Conclusion

There have been many version control systems over the years, the most popular of them being CVS, SVN, Mercurial, Fossil and Git. Each of them have their benefits and and shortcomings, but they all share one things in common: They help you do your job faster, easier and more efficient. If you therefore want to reduce your overhead of managing versions, searching through them for changes, stop using e-mailing or USB sticks for sharing your work and learn something new, you know what you need to do from now on: start learning more about version control.

comments powered by Disqus