Monday, August 18, 2014

Janitor or sexy librarian?

I was hopping mad after I read For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.

Equating my work with a janitorial service?  The nerve!

But, admiring the view from my window and investigating two reports of possible data corruption in one day took precedence over hyperventilating about one ill-informed article.
After rereading the article, I don't think it's as bad as the headline would suggest. Steve Lohr is only guilty of selecting unfortunate quotes and choosing to interview data gold rush miners while ignoring data veterans in the government.

How many times does he have to quote men saying that data science is "sexy" and data wranging/munging/cleaning is not?  Notice that only the men say that.  The women speak more holistically about data work.

If the majority of our time--whether it is the 50-80% quoted in the article or the 80-90% I hear in meetings with other data veterans--is spent on data preparation, then doesn't that make it our "real" work?

I'm going to risk stating the painfully obvious:

SCIENCE IS BUILT UPON A FOUNDATION OF DATA.  IF WE DO NOT ENSURE THE INTEGRITY OF THE DATA, THEN THE ENTIRE SCIENTIFIC ENTERPRISE COLLAPSES.

It's all about the data.  And data support work is a necessary and critical step in order to get correct answers.  Otherwise, it is GIGO (garbage in, garbage out).

It's late, and I need to write a tutorial to teach others how to use open-source data language, R, to read and manipulate GRIded Binary (GRIB) weather data from NOAA/NCEP in order to answer their real-world questions.

After that, I'll be writing tutorials to teach techniques for data fusion--combining different datasets--for new insights.

I'll do that in tandem with curation of an old dataset made for a defense purpose, but with value to many fields.  This requires writing new documentation to introduce the dataset to a new audience of researchers in disciplines as disparate as computer vision/pattern recognition and wind energy.  (Introducing non-expert users to new-to-them data has to be done carefully because terminology varies between fields.  That deserves a post of its own.)

OK, this won't all get done in one night.   More later.

4 comments:

  1. "...investigating two reports of possible data corruption in one day took precedence over..."

    I see many parallels between what you describe as your job of the day and what I do. Maybe plumber is more accurate than janitor, since plumbers bring fresh water in and sewage out..

    ReplyDelete
    Replies
    1. Yes, but I'd like to earn as much as my plumber. ;-)

      I met with 2 other data managers today and we decided that data wrangling is our preferred term. We need terminology that reflects the high level of skill necessary to figure out what is "right" or "wrong" with the data.

      Delete
  2. People are always surprised by the amount of behind the scenes work you have to do before you can do any real data mining! I've used the terms data wrangling for gathering up all the disparate resources and linking them together and data cleaning for fixing all the mistakes in the source data that make your mining tools crash...

    ReplyDelete
    Replies
    1. Yes, I hate to call it prep-work or pre-work, because it often is the bulk of the work. Backstage support is largely invisible, but the show would not go on without it.

      I wrote a follow-up.
      http://badmomgoodmom.blogspot.com/2014/08/data-archeology.html

      Delete