Workshop 2 Understandable
Last updated
Last updated
The goals of the second workshop are:
show you how to organize your study in a way that others (fellow students, researchers or programmers) can easily understand what an amazing job you did;
remind you how efficiently name variables, program functions and why code formatting is important (PEP-8, Black, etc.);
practice it on a case study.
Let's talk about workflows first. What is a workflow in simple terms? Well, it's a flow of work, the steps that you need to undertake to solve a problem or a task in accordance with a domain (i.e. data science, engineering). A workflow can be pretty extensive in tools that should be used to support each of the actions, or a more high-level (so-called step-wise approach). Here is a formal definition by Business Process Management Center of Excellence Glossary (Wikipedia contributors, 2020):
"Workflow is an orchestrated and repeatable pattern of business activity enabled by the systematic organization of resources into processes that transform materials, provide services, or process information."
Even though different domains require domain-specific knowledge to define a workflow and follow it, there are certain similarities across problem-oriented workflows. You start with a problem, do something in the middle and propose a solution or an insight at the end. Here is a great example:
Workflows are usually depicted with a diagram. Data science studies often depicted as directed acyclic graphs (DAGs), whereas in reality process looks similar to the System Dynamics one (going back and forth from one stage to another).
Sounds obvious, of course, but it there are a couple of important implications to keep in mind. First, qualitative and quantitative workflows have a similar skeleton. Consider a step-wise approach proposed by Hermans & Cunningham (2018) and a workflow for data scientist by Byrne (2017):
Such a high-level similarity can be helpful in establishing a transdisciplinary collaboration and promotes joint understanding of the work process.
Second, it is important to remember that you are addressing a problem or a question (and not the method). Identification of the"right" problem is time consuming (as well as finding an appropriate method). Remember famous Russel Ackoff's saying?
"Successful problem solving requires finding the right solution to the right problem. We fail more often because we solve the wrong problem than because we get the wrong solution to the right problem."
In case of a data science project it may seem trivial. The methods allow you to predict, to cluster, or to try to explain a certain phenomenon. But when you start working, you can realize that there is a different angle that seems more promising. Just remember, violating from the original problem or doing research in an exploratory fashion, one hand problem, can bring unexpected benefits, but on the other side, can bring an extra burden and shift your deadlines.
Finally, students should remember that the workflow ≠ thesis structure. While it seems attractive to have a step of predefined steps, modeling process does not equal to scientific research process. Modeling serve to help answer research questions, propose solutions or evaluate policies. They are instruments of analysis.
Here we have two things to discuss: how to organize your project (the files) and how you write your code.
The project structure follows from the chosen workflow. For example, if you are doing a data science research and decided to follow the workflow from above you should have Jupyter Notebooks and scripts dedicated to each of the "steps." Simulation studies conducted "purely" with programming languages (see amazing and free Think Complexity by Allan Downey), follow the same logic (i.e. separate the simulation model from analysis of experiments). If you are using a certain simulation modeling software, for example, Simio, you are forced to follow its internal logic.
It is important how name your variables and functions and comment the code. Consider a variable blah4
. Yes, that's a real variable! It is a data frame with a bus schedule inside. The name was given by an junior employee of a mid-size data science company working in sustainable transportation. And you know what, it is not a problem! First make it work, then make it right! The problems begin when he will placed on another project. After a week or a month is back and how easily he will figure what blah4
stands for? So instead of exercising your memory, try to name your variables in human-readable format: bus_schedule
. Easy-peasy, right?
we follow a certain workflow,
name our notebooks according to the step,
variable and function names are also fine,
Let us practice these and the tools from Workshop 1 Reproducible on a case study. We prepared 3 options for you:
Exploring emergency calls in the Netherlands;
Understanding dynamics of COVID-19 in the Netherlands;
A project that you are currently working on.
We do not state a goal for each of these studies, you can formulate it yourself. But if you feel lost, here are a some options:
Predict the number of firefighter calls based on the socio-demographic and housing variables (try AutoML);
Cluster positive tests curves (see tslearn) and find similarities in cities of the same cluster (i.e. population size).
The data can be found here:
Formulate a problem or a hypothesis;
Get the data;
Visualize it;
Apply a simple model;
Report your findings;
Upload the work to the GitHub.
Wikipedia contributors. (2020, March 23). Workflow. In Wikipedia, The Free Encyclopedia. Retrieved March 23, 2020, from https://en.wikipedia.org/w/index.php?title=Workflow&oldid=946935282
Hermans, L., & Cunningham, S. W. (2018). Actor and Strategy Models. Wiley Blackwell.
Ciara Byrne (2017). Development Workflows for Data Scientists. O’Reilly Media, Inc.
Richardson, G. P., & Pugh III, A. I. (1981). Introduction to system dynamics modeling with DYNAMO. Productivity Press Inc..
Guess what is another way to reduce the effort that it is needed to understand the code? Exactly! Commenting it! However, commenting is also tricky. You do not want to overwhelm yourself, colleague or another researcher with a novel on how you did this and that. Here is a trick. If
then the amount of comments should be much less! And of course, communities are helping us thestandards (i.e. PEP-8 or Vensim naming conventions).
Warning: These are just exercises that we are proposing you to work on in the class. Without a doubt, such a simple analysis cannot capture the whole complexity of any these problems. So be aware and do not make quick conclusions.
Tips: Try Wayback Machine to get the past data. For example, you can complete RIVM data set on COVID-19 by municipality use the following query.
We expect you to do a minimalist study:
So, select an exercise and roll-on! After your finish the exercise we will be happy to check it out and give some feedback. Just send us the GitHub repo and we will send you the feedback .
№
Original data sets
Processed sample data sets
1
2
When?
What?
10:15 - 10:30
Getting ready up with BBB
10:30 - 10:45
Recap of workshop 1
10:45 - 11:00
REDCAR project and workshop 2 introduction
11:00 - 11:30
Talking about workflows
11:30 - 11:45
Break
11:45 - 12:15
How to write better code?
12:15 - 13:00
Starting with a case study