Marie-Donnie

Reproducibility in computer science

29 Apr 2019 - marie

Reproducibility is crucial for any experiment, considered as a hallmark of scientific method. While this is widely admitted by the scientific community, there are still a lot of experiments that does not apply any of the methodologies that can be used to achieve reproducibility. Moreover, reproducibility is a term widely employed, but it lacks a standard description of what is clearly required. We will dive a bit more in the different level of reproducibility and how it can be achieved.


In computer science, the subject of experiments is usually a software, a software stack, or an Operating System. It can be used to evaluate its performance in terms of time, security, scalability, etc. Experimenting on complex softwares is time-consuming, prone to error, hard to reproduce. But reproducing experiments is extremely important to broaden the scope of experiments and increase confidence in the results. There are a lot of sources on reproducibility, but often the definitions vary from one to another.

The ACM (Association for Computing Machinery) defines three levels of reproducibility 1. They suggests a badging system for scientific teams who wants to publish reproducible experiments. This system requires an artifact that will undergo a process of checking which level of reproducibility is achieved. This artifact is composed of everything that can be used to test the reproducibility of the experiment, e.g., scripts or softwares produced to run the experiments, input or output datasets, etc. The three levels of reproducibility, according to ACM, are:

Repeatability (Same team, same experimental setup)

The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation.

Replicability (Different team, same experimental setup)

The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.

Reproducibility (Different team, different experimental setup)

The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.

These definitions are explicited more in From Repeatability to Reproducibility and Corroboration2. The author differenciates two other levels, variation, where someone either repeat or replicate the experiment with a modification of a parameter, and corroboration, where someone obtain the same results by other means.

Instinctively, we feel that reproducibility purpose is to verify results of an experiments. While this is true, it is not the only reason. The main purpose of reproducing an experiment is to enlarge its scope, and to get a general, basic understanding of the measured system. Each level of reproducibility helps towards this goal.

Repeatability

Repeatability is defined as the exact repetition of an experiment, under the same conditions. It is the minimum level expected for any experiment. How can one trust the results of an experiment if even the original team can not duplicate their own results? Though this does not really increase the confidence in the experimental results, it paves the way towards a usable artifact by pushing the experimenter to package everything and list all required configurations2. Usually the experiment will be repeated by the original team of experimenters to verify their own results before publishing them, but also because they will probably need to execute the workflow multiple times to refine the experimentation code (bug fixing, scalability issue, etc.).

Requirements

To repeat an experiment, one needs access to the same or equivalent infrastructure and the artifact used for the experiment. It is highly recommended when making any experiment to build a solid set of scripts or yet better a true software. It might feel time consuming, but with good libraries the burden can be lessened, and the time it takes to re-execute the experiment is really improved. Moreover, it helps to anticipate for further levels of reproducibility.

Replicability

Replicability provides important data about robustness to slight variations (even though it is supposed to be the same experimental tools) between the subjects of an experiment and the replicas. In theory, replicability should be repeating the same experiment in the same conditions. But as we know, the state of a complex system is really hard, if not impossible to achieve again34. So we have to take this into account when replicating an experiment, that the results might not be exactly the same, but the important is to keep the trend.

Requirements

TLDR

There was a analogy in Feitelson’s article2 that was taken from Schmidt’s paper5:

“Imagine, I were to show you a knife I have recently invented that cuts stone as easily as butter. I demonstrate this several times (replication) to you by cutting pieces of stone. You might be impressed by the demonstration but not really convinced that it works. However, you might be a bit more convinced if I were to demonstrate it again on a different type of stone, and even more convincing if I were to give you the knife and you (as a different person) were also able to cut one of my stones. But there might still be a trick or something wrong with the material I am employing. So you will be a lot more convinced if you could repeat the experiment in your home (different place). But I think the most convincing strategy of all would be to give you a proper description of how to produce such a tool, so you can manufacture your own different knife, completely independently from what I have done.” I find this pretty accurate, though it refers to the variations of an experiment (different place) that I did not talk about much.

I illustrated this in the figure below. Blue cuts the stone with the knife she crafted. This is the experiment. She does it several times with the same knife; this is repetition. She then gives her knife to green, who replicates the experiment on the same stones. Green is given instructions to produce his own knife and reproduce the experiment by interpreting the instructions he was given. Reproducibility

Figure 1: The stone cutter experiment repeated, replicated and reproduced.

Tools

Further reading

Some unquoted sources for really interested people

These links are some articles I’ve read for the production of this article, mostly about tools to repeat/replicate/reproduce the experiments:

Sources