Workshop 2 Understandable

Introduction

The goals of the second workshop are:

  • show you how to organize your study in a way that others (fellow students, researchers or programmers) can easily understand what an amazing job you did;

  • remind you how efficiently name variables, program functions and why code formatting is important (PEP-8, Black, etc.);

  • practice it on a case study.

Workflows

"Workflow is an orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information."

Workflows are usually depicted with a diagram. Data science studies often depicted as directed acyclic graphs (DAGs), whereas in reality process looks similar to the System Dynamics one (going back and forth from one stage to another).

Sounds obvious, of course, but it there are a couple of important implications to keep in mind. First, qualitative and quantitative workflows have a similar skeleton. Consider a step-wise approach proposed by Hermans & Cunningham (2018) and a workflow for data scientist by Byrne (2017):

Such a high-level similarity can be helpful in establishing a transdisciplinary collaboration and promotes joint understanding of the work process.

Second, it is important to remember that you are addressing a problem or a question (and not the method). Identification of the"right" problem is time consuming (as well as finding an appropriate method). Remember famous Russel Ackoff's saying?

"Successful problem solving requires finding the right solution to the right problem. We fail more often because we solve the wrong problem than because we get the wrong solution to the right problem."

In case of a data science project it may seem trivial. The methods allow you to predict, to cluster, or to try to explain a certain phenomenon. But when you start working, you can realize that there is a different angle that seems more promising. Just remember, violating from the original problem or doing research in an exploratory fashion, one hand problem, can bring unexpected benefits, but on the other side, can bring an extra burden and shift your deadlines.

Finally, students should remember that the workflow ≠ thesis structure. While it seems attractive to have a step of predefined steps, modeling process does not equal to scientific research process. Modeling serve to help answer research questions, propose solutions or evaluate policies. They are instruments of analysis.

Better code

Here we have two things to discuss: how to organize your project (the files) and how you write your code.

The project structure follows from the chosen workflow. For example, if you are doing a data science research and decided to follow the workflow from above you should have Jupyter Notebooks and scripts dedicated to each of the "steps." Simulation studies conducted "purely" with programming languages (see amazing and free Think Complexity by Allan Downey), follow the same logic (i.e. separate the simulation model from analysis of experiments). If you are using a certain simulation modeling software, for example, Simio, you are forced to follow its internal logic.

It is important how name your variables and functions and comment the code. Consider a variable blah4. Yes, that's a real variable! It is a data frame with a bus schedule inside. The name was given by an junior employee of a mid-size data science company working in sustainable transportation. And you know what, it is not a problem! First make it work, then make it right! The problems begin when he will placed on another project. After a week or a month is back and how easily he will figure what blah4 stands for? So instead of exercising your memory, try to name your variables in human-readable format: bus_schedule. Easy-peasy, right?

  • we follow a certain workflow,

  • name our notebooks according to the step,

  • variable and function names are also fine,

Case study

Let us practice these and the tools from Workshop 1 Reproducible on a case study. We prepared 3 options for you:

  1. Exploring emergency calls in the Netherlands;

  2. Understanding dynamics of COVID-19 in the Netherlands;

  3. A project that you are currently working on.

We do not state a goal for each of these studies, you can formulate it yourself. But if you feel lost, here are a some options:

  1. Predict the number of firefighter calls based on the socio-demographic and housing variables (try AutoML);

  2. Cluster positive tests curves (see tslearn) and find similarities in cities of the same cluster (i.e. population size).

The data can be found here:

  1. Formulate a problem or a hypothesis;

  2. Get the data;

  3. Visualize it;

  4. Apply a simple model;

  5. Report your findings;

  6. Upload the work to the GitHub.

References

  1. Wikipedia contributors. (2020, March 23). Workflow. In Wikipedia, The Free Encyclopedia. Retrieved March 23, 2020, from https://en.wikipedia.org/w/index.php?title=Workflow&oldid=946935282

  2. Hermans, L., & Cunningham, S. W. (2018). Actor and Strategy Models. Wiley Blackwell.

  3. Ciara Byrne (2017). Development Workflows for Data Scientists. O’Reilly Media, Inc.

  4. Richardson, G. P., & Pugh III, A. I. (1981). Introduction to system dynamics modeling with DYNAMO. Productivity Press Inc..

Agenda

Last updated