I was hopping mad after I read For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.
Equating my work with a janitorial service? The nerve!
But, admiring the view from my window and investigating two reports of possible data corruption in one day took precedence over hyperventilating about one ill-informed article.
How many times does he have to quote men saying that data science is "sexy" and data wranging/munging/cleaning is not? Notice that only the men say that. The women speak more holistically about data work.
If the majority of our time--whether it is the 50-80% quoted in the article or the 80-90% I hear in meetings with other data veterans--is spent on data preparation, then doesn't that make it our "real" work?
I'm going to risk stating the painfully obvious:
SCIENCE IS BUILT UPON A FOUNDATION OF DATA. IF WE DO NOT ENSURE THE INTEGRITY OF THE DATA, THEN THE ENTIRE SCIENTIFIC ENTERPRISE COLLAPSES.
It's all about the data. And data support work is a necessary and critical step in order to get correct answers. Otherwise, it is GIGO (garbage in, garbage out).
It's late, and I need to write a tutorial to teach others how to use open-source data language, R, to read and manipulate GRIded Binary (GRIB) weather data from NOAA/NCEP in order to answer their real-world questions.
After that, I'll be writing tutorials to teach techniques for data fusion--combining different datasets--for new insights.
I'll do that in tandem with curation of an old dataset made for a defense purpose, but with value to many fields. This requires writing new documentation to introduce the dataset to a new audience of researchers in disciplines as disparate as computer vision/pattern recognition and wind energy. (Introducing non-expert users to new-to-them data has to be done carefully because terminology varies between fields. That deserves a post of its own.)
OK, this won't all get done in one night. More later.