Reproducible research

Last updated on May 26, 2019 8 min read

Research is a long ongoing process. You will see this sentence being repeated often in this book. The fact that is long and the fact that is ongoing means that often, you will have to jump back in time and re-evaluate things that you have been doing. That might be necessary either because you (or someone else) might have suspicions that you’ve made a mistake somewhere, either because you received new knowledge that maybe help you improve your work or because simple you want to expand what you’ve been doing earlier on.

Besides the fact that you might want to re-evaluate what you’ve been doing, there will be many others that might want to do the same. It can happen that, after you publish some results in a journal, someone will ask you details for your analysis. Or it might happen that during the revision process of a publication you need to re-draw a specific figure to improve its colour contrast, or to add new data. Or finally it (almost certainly) will happen that a colleague of yours will need part of your results for her analysis.

And going back in time is not always easy. Not as easy as it seems at least when you create your figures, write your code or arrange your files in your computer. It requires going back in time and remembering all the details of what you did maybe years ago. From remembering where your files are stored, to what procedure you need to follow to get the required result and finally to interpreting what you see on the screen.

To help you help your colleagues better, revise your articles easier and to enable better knowledge transfer amongst researchers, it is important to engage yourself in what is called ‘reproducible research’

What do we mean by reproducible research?

Reproducible research is a trend of the last decades that helps deal with the continuously growing volume of research data and outcomes. One could falsely assume that having more data available is in general beneficial for the research community, but that is only true if these data are well organised and easily reproducible.

By definition reproducible research is a way of conducting research that allows:

Reproducing all the results obtained from a study in an as much as possible automatic procedure
Besides the results, figures and tables should also be generated as the output of this automatic procedure
The steps followed, devices used, adaptations made during the research should be well documented
The results, figures and tables should be in a format that is easily accessible by anyone

Where can it be useful?

The main reason for this trend is to allow researchers to easier build upon already conducted research and secondly to be able to verify independently research outcomes. This way research progress is more rapid and more reliable.

More and more journals and more and more institutions and funding bodies request from researchers to not only deliver their written outcomes in their publications, but also all the original and processed data as well as the source programs that were used to produce these outcomes.

How to do it?

To my view, reproducible research is based on four pillars, that are described bellow. All of them are explained briefly in this chapter, but one chapter is dedicated for each one of them later on.

First pillar: Open Source Software

One of the main pillars of reproducible research is the use of open formats for the outcomes and open source software. Open source software (OSS) is freely available software that anyone can download and install on a computer. It is usually maintained by a team of people on a volunteer basis, however there are multiple examples of for-profit companies producing and distributing OSS. OSS is also usually cross-platform (which means that you can execute it on platforms of different architecture e.g. Windows, OSX, Linux etc.).

The rationale behind the use of OSS is to allow anyone with basic computer access to be able to analyse results of a specific study without the need for expensive and difficult to obtain software. Another important reason is that of the availability of proprietary software in the future. Use of proprietary software does not guarantee that this software will be available in the following years, as the company that is producing it might decide to stop its distribution, making further analysis of your results impossible. Furthermore, a big advantage of OSS has to do with its free nature, that allows research institution save money that could be invested in personnel or equipment.

Second pillar: Scripting

Scripting is defined as the process of automating a process using pre-defined commands, written in a specific language, that are running in a specific order. The idea behind it is to delegate all the repetitive and ‘boring’ tasks of a user to a computer. The benefit that scripting brings to the user is not only that these tedious being performed faster than if they would be ran by the user, but also that this can happen without the attendance of the user. Therefore, the user can focus on another task and therefore parallelise tasks and save valuable time.

Furthermore, automating a process brings useful side effects: the use of a script requires that input and output files are very well structured in folders and have very specific sturcture in their filenames. Therefore, not only you improve your performance, but you also force yourself to keep an order on your files.

Finally, when it comes to reproducible research, scripting is extremely valuable as it imposes the breaking down of a long process to small simple steps. Each of these steps is easier to comprehend and adjust if necessary. As long as each step is giving output in the specific format that is expected in the input of the following step, then steps can be adapted and improved. Consequently, if the whole process is scripted, anyone that has the script and the raw data can reproduce your results and figures.

Third pillar: Documentation

When it is important for others to understand your work, it is crucial that is documented properly. That means that there is a clear structure of your data and also that this structure is somewhere explained. Not only your data should be nicely organised, but all the files related to them, source code that you might have used and output data and figures.

All processes that were used for performing your resource should also be documented. Make sure to write down the steps for each experiment, the conditions that were measured, the date, the people involved and any kind of information that you might thing or not that is relevant.

Documentation is even more important when dealing with programming and scripting (explained above). Make sure that all variable names make sense and that there are well defined conventions for them. Furthermore, add comments that explain what each chunk of code is used for, what are the inputs and what are the outputs. This will aid your colleagues to understand your code and to detect possible problems or improvements for it.

Documentation is not only important for your colleagues, but also for yourself. It is pretty common to look back at your files/programs and to not recognise the structure or the meaning of each part. And that is something you need to take into consideration when writing your documentation: your future self.

Fourth pillar: Version control

Keeping good documentation and structure on your files and data is not a trivial job. It requires constant effort and discipline and the will to go through all your files for tidying up every now and then. This process becomes even more complicated when you want to store different versions of various documents, code or results. It is not always desirable to throw away older versions of an article in progress, or of a piece of code as you might want to go back to that version for reference.

To address these kind of issues, some smart people developed software to perform what is known as ‘version control’. A version control software (VCS) is a program that helps you store different versions of a set of files in a structured way. It allows to quickly browse through different versions and to compare differences between them. Each version is associated with a short message that identifies what changed in that specific revision. It is therefore easier to keep track of the progress of a project, both for you and for whoever will have access to it.

What is in it for you?

These four basic pillars are in my opinion the most important aspects of reproducible research. But why should you bother to make your research reproducible? Well, here is a short summary of the benefits of this practise:

Your research and your results are well structured and searchable
It is easier for colleagues to understand and contribute to your work
Your results are more reliable, when publishing your analysis
It is easier for you to understand your own results after long periods of time
If small adaptations are needed to your results or figures, it is fairly easy to adjust them

I-Want-a-Phd