Data wrangling is a necessary process when working with big data; most data, in reality. This opinion piece is not to diminish its importance. Nor, is this to be confused with Data Engineering. But I will argue that data wrangling is career strangling, in that it is holding you back in your career progression. Let me explain…
Firstly, let’s agree that the whole basis of big data is to whittle it down to little data, that we call “Insights”. The point of any data analysis is to identify a trend or anomaly. The point of a machine learning model is to find a set of defined patterns or assign a probability.
Observe any Data Scientist or Analyst presentation and the only pieces that get talked about are the Insights and the model. Zero time is spent explaining how the data was wrangled, despite that being 60-80% of the effort.
I am making the argument that data wrangling is low-level, tedious work that is wasted when an expensive resource such as a Data Scientist or Data Engineer or Analyst decides to take this on.
The best consultants know that:
You don’t get paid for the hour. You get paid for the value you bring to the hour
The more time you spend on lower value work, the more you diminish your value.
And if you’re an Analyst / Data Scientist spending a greater portion of your time wrangling data, that’s much less time that you’re spending to understand the data, that’s much less time you’re spending to analyze the data, that’s much less time to you’re spending on delivering business value from the data.
When it comes to big data, I believe that folks are starting to realize that robust software engineering practices need to be put in place to ensure quality of the data pipeline and #datagovernance. …Cue the Data Engineer.
In today’s episode (Aug 14) of the Digital Analytics Power Hour (a wonderful podcast, btw), there was a great discussion about raw data and data virtualization. I didn’t feel that there was any consensus, so I’ll throw in my 2 cents.
A company must adopt a tool or process to virtualize the raw data for the Data Scientists and Analysts. Drawing from software principles, the solution — built in-house or purchased — must be robust, scalable, extendable, and re-usable.
This will save an immense amount of time (and headache).
For example, when working with raw clickstream data, you have billions of atomic events. In most cases, identity resolution is required over a specified period of time. If every Data Scientist or Analyst is starting with the raw data, I guarantee that each will resolve the identity in a different manner (different “code”). This leads to multiple, inconsistent “truths”. The Analysts / Data Scientists should only work from a consistent, consolidated schema for the vast majority of cases.
So, when I say “Data wrangling is career strangling”, it’s because you’re devoting too much time to work with a lower-assigned value.
[Tangential annecdote: I use Salesforce a lot in my work. If I’m to be diligent, the data entry could be up to 4 hrs a week. I hired a VA — on my own dime — to handle this. This allows me to spend more time on higher value (and quite frankly, more fun) tasks. I value my time]
In the end, businesses are results-oriented. If you can produce more positive business results in a shorter time frame, then your career trajectory will move up-and-to-the-right at an accelerated pace.
And it’s a compounding factor. Those that produce results are provided more opportunities. The sooner you produce results, the sooner those opportunities present themselves.
Focus on value delivered.
The faster you iterate, the faster you grow.