How to optimize a data pipeline

This is a personal reflection after my experience working for more than two years at Datawheel. By no means my opinions reflect on my former employer.

What I’ll try to explain here is this famous diagram from R4DS from an often uncovered approach:

Because what I express here is a personal perspective, I shall first present my background, so you can be aware of potential biases in this account. I am a Chilean and studied as an undergraduate in Chile. I lived in China and everything there was different in comparison to Chile.

My major was in Business Administration and Economics, but I also took more than 10 courses from different departments including a graduate diploma – courses that ranged from optimization, real analysis, game theory, to international studies. You can imagine I had lots of things to read and that I had to prove a lot of theorems but I never coded before my senior year, and later than sooner I found that you can’t do serious data science in a GUI.

In Chile college involves an extra year after the bachelor’s degree – somewhat similar to an applied master’s degree – My thesis advisor was Dr. Edgar E Kausel, a Psychologist by training who knows a lot of statistics and knows how to solve ODEs/PDEs. I was lucky to have such a “renascentist” advisor! Is because of Edgar influence that now I write R packages and I have a penchant for linear models.

Since that I worked for more than two years at Datawheel. When I started I assumed that most of my day would be about data wrangling but I was totally wrong. I also thought that after that I could move to MIT Media Lab, and I was wrong again.

I built a part of Data Chile and I joined that project at its very beginning. I was also in charge of curating data for The Observatory of Economic Complexity, a project I wish I started but I joined on a later stage.

Data Chile was a good human skills bootcamp. Because public data is often sent by email or you have to visit an office with a usb stick to obtain it, I got used to ask for data, explain why I need it and how I’m going to use it. I interacted a lot with lawyers who work for the government. The project involved a lot of design and I put a ton of effort to make data as good as possible, it was the only possibility when I had superb sketches and designs that moved me to work well. Data Chile was data cleaning intensive, not data modeling intensive, and lack of documentation made me do a lot of phone calls and write tons of emails.

The Observatory of Economic Complexity was really different. Instead of interacting with humans, data obtaining was simple as I only had to read UN Comtrade API documentation and then human interaction was limited to interacting with colleagues who taught me a lot about design and front-end. On the contrary to Data Chile, the OEC made me set up a server (my laptop was not enough) and put a lot of effort on modelling and thinking how to write performing code. Raw data was quite clean but processing it demanded to think a lot about metrics and aggregations. I released all the scripts I wrote on GitHub where you can find that I used parallelization and Rccp after removing bottlenecks. Even more, I documented everything here using gh-pages and bookdown.

Data can be challenging, no matter if it comes from human transcripts on a spreadsheet or straight from IoT devices. We haven’t yet developed the political, social or legal environments for better data and because of that an important part of my work consisted in making and maintaining ties.

Your ability to communicate clearly determines how well you can perform at work. Data analysis skills can be improved if and only if communicational skills can be improved on an individual, within-group and between-groups levels.

There’s a chance that you can be a Tidyverse wizard or even an Rcpp wizard. It won’t matter if you face situations such as misleading people who use your data because your explanations are badly formulated. Another common situation is that you realize you have limited documentation and your questions were poorly prepared so the survey or dataset creators cannot give you the answer you need.

A good data analyst or data scientist – which are different roles – knows how to interact with a lawyer or a designer. If you don’t succeed at communicating then all the steps in your data pipeline are useless.

Do you know all the articles in the Constitution? Do you know how to use Photoshop really well? probably not, so the mirrored version of this is that when you have to discuss legal aspects of data or design aspects of data-driven websites the people that interact with you don’t necessarily know how to use R or Python, so keep it simple for them and don’t be afraid to talk too much as long as you express ideas more than code.

While it is neccesary to read books and blogs and taking MOOCs to be always learning, your brain needs both fiction and non-fiction. How to Win Friends and Influence People is a book that can help you optimizing your data pipeline in an unexpected way.

Please read! I thought all the theorems I studied at college were useless until at some point I was fitting some regressions when I had the eureka thought and I said “wow, ANOVA is linear regression”. Is because of the theorem-proof approach that I can sit and code efficiently not because of the theorems, but because of theorems I’m used to think and try to look at things from different angles.

If you pick a Measure Theory book there a chance that you will code better in a few months. The same applies if you pick One Hundred Years of Solitude or On the Road and read it. A data analyst with a rich lexicon can explain code to people who has never coded before and to people who have no idea about what %>% is.