Starting page

Starting page

Introduction

REproDucible ComputAtional Research or simply REDCAR is a project initiated by TU Delft HumTechLab. The goal of the project is to help students and researchers to make their computational results reproducible, more easily understandable, and accessible to others.

Background

Big data, powerful computers, and programming languages such as Python and R brought quantitative research to a whole new level. Now you can apply a machine learning algorithm to train a car or a drone to become automated with 2 lines of code (proof).

Along with the opportunities, such a technological leap brought complications. Scientists became overwhelmed with all details that should be taken into account while conducting a computational study. Given the pressure from the deadlines, they are forced to decide between "quick-and-dirty" and reproducible research, not in favor of the latter. Manual corrections of the raw data, conflicting versions of software packages, lack of instructions on how the code should be executed, making it hard if not impossible to reproduce results of a computational study. As a result, more and more scholars have started to highlight the importance of reproducibility, propose ways to achieve it and pose it as a minimum standard for assessing the value of scientific claims [1,2].

But what do we mean by reproducible? In this project we will use definition introduced in [3]:

"The ability to implement, as exactly as possible, the experimental and computational procedures, with the same data and tools, to obtain the same results."

We won't talk about empirical or statistical reproducibility but instead computational reproducibility [4].

Alright. Now you understand what reproducibility is and what it's important. But what are the ways to ensure it? In short, you need to follow the rules described in [1], and use the tools and practices that were developed for that (see, e.g., [5]). But in the REDCAR project, you'll find a bit more than that.

REDCAR

With the REDCAR project, we're aimed at achieving more than reproducibility ๐Ÿ˜Ž. We realize how important is the structure of the study, code formatting, and whether it's easily accessible by other researchers or the general public. We translated these additional principles into 2 extra components: understandable and shared (see Figure on top).

Understandable here stands for how easily others can figure out that you have done. For example, if your project folder looks like this, then "Houston, we have a problem." The same holds for the code. It's much easier to reuse and modify a program that was written in compliance with coding standards. You met two variables d = 5 and elapsed_time_in_days = 5. Which one is better?

If you want to share the results of your computational research, there may be no need to install Anaconda Distribution or NetLogo. Instead, you can use such instruments as Binder Project, Google Colab or NetLogo Web. All of them allow a person to execute the code in the cloud and therefore significantly simplify the process sharing and collaboration.

โ€ŒWe linked all these 3 components into a system and supervised them with tools and practices. They are distributed across 3 workshops and can be found in the directories with corresponding names. By following the workshops in a sequential manner, you will ensure that your research a). reproducible, b). more easily understandable, c). accessible to others.

โ”œโ”€โ”€ 1-reproducible
โ”‚   โ”œโ”€โ”€ 1.1-get-started-anaconda.md   <- Create virtual environment and example project structure
โ”‚   โ”œโ”€โ”€ 1.2-get-started-git.md        <- Learn the basics of Git and GitHub 
โ”‚   โ”œโ”€โ”€ 1.3-git-jupyterlab.md         <- Setup JupyterLab extensions to make life easier
โ”‚
โ”œโ”€โ”€ 2-understandable          
โ”‚   โ”œโ”€โ”€ 2.1-workflows.md              <- Workflows for data science and simulation studies 
โ”‚   โ”œโ”€โ”€ 2.2-better-code.md            <- Practice standards, conventions and common sense 
โ”‚   โ”œโ”€โ”€ 2.3-case-study.md             <- Try it all on a case study
โ”‚
โ”œโ”€โ”€ 3-shared                 
โ”‚   โ”œโ”€โ”€ 3.1-setup-binder.md           <- Make MyBinder.org work with your repo 
โ”‚   โ”œโ”€โ”€ 3.2-colaboratory.md           <- Try Google Colab as an alternative 
โ”‚   โ”œโ”€โ”€ 3.3-aws-s3.md                 <- Store your large data set on AWS servers

To get the maximum of the project we invite you to come and participate in the hands-on exercises. But if didn't work, get hands dirty with tutorials by yourself๐Ÿ’ช . We tried to make them as clear possible so they can serve as cheat sheets as well. Forgot something: just open the book and follow the instructions.

To be prepared

To participate in the workshops you will need a laptop and a couple of tools installed. The preparation process will take less than 30 minutes.

  • Download and install Anaconda Distribution with Python 3.7 from here. The process is pretty straightforward: select your operating system, download installer and follow the steps. If you already have it, make sure that it works by running any script in JupyterLab (that's the IDE that we will work in). If you prefer to use R programming language - no problem! After installing Anaconda Distribution, open it and install RStudio. To use R in Jupyter Notebook follow this simple tutorial.

  • Install Git from here. The same principle works here: select your operating system and follow the steps.

  • Create a GitHub a account here. Don't forget about GitHub Student Developer Pack. It provides free access and discounts to plenty of services and tools.

That's it! All set now.

Contributing & authors

We're highly interested in your opinion on the project! To contribute please, either fork it and submit a pull request, or contact us via Twitter or email.

Mikhail Sirenko @mikhailsirenko, Nicolas Dintzner, Jason R. Wang @jasonrwang and Trivik Verma @TrivikV, Bartel Van de Walle @bvdwalle.

License

CC-BY-NC-SA-4.0

Acknowledgements

While working on this project the REDCAR project team was inspired by Cookiecutter Data Science [6] made by friendly folks at DrivenData , Reproducible Research module of Data Science Specialization [7] by Jeff Leek, Roger D. Peng and Brian Caffo.

We also would like to thank Jan Kwakkel, Igor Nikolic, Alexander Verbraeck for their input into the shaping of the project.

References

  1. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS computational biology. 2013 Oct;9(10).

  2. Peng RD. Reproducible research in computational science. Science. 2011 Dec 2;334(6060):1226-7.

  3. Goodman SN, Fanelli D, Ioannidis JP. What does research reproducibility mean?. Science translational medicine. 2016 Jun 1;8(341):341ps12-.

  4. Stodden V. Reproducible research: Tools and strategies for scientific computing. Computing in Science & Engineering. 2012 Jul;14(4):11-2.

  5. Stodden V, Leisch F, Peng RD, editors. Implementing reproducible research. CRC Press; 2014 Apr 14.

  6. DrivenData. Cookiecutter Data Science. Available from https://drivendata.github.io/cookiecutter-data-science/ [Accessed 03 March 2020].

  7. Coursera Inc. Data Science Specialization. Available from https://www.coursera.org/specializations/jhu-data-science [Accessed 03 March 2020]

Last updated