Hearing the Oracle

Home » Big Data » Will the real Data Scientist please stand up?

Will the real Data Scientist please stand up?

Enter email address to receive notifications of new posts by email.


guess my bi job

guess my bi job

When fresh buzz is still shaking out around a trend in IT, e.g. now, things like job descriptions and resumes and internal hot project hype can distort to the max. Roles and expectations are well solidified for something like an operational grid DBA, but during this wild and woolly shakeout period for dataists, everyone has their own shtick approach. Bloggers, commentators, and analytics shops have all stepped up to the plate to disambiguate this lively mess. Here are a few of my favorite takes.

“Analyzing the Analyzers” is a free e-book available from O’Reilly Press which takes the tack of surveying some working data analysts themselves. The sample size is not huge, as might be expected, but some worthy insights emerge from their in-the-trenches point of view. Both graphics below are from this effort. The first attempts to abstract a typology of four basic dataists: businesspeople, creatives, developers, and researchers. These are not cleanly disjoint categories; skillsets overlap and different organizations might arrange teams with differing ingredients, depending on what resources already are around and also their size.
w h i t e s p a c e

data scientists self-eval; source: O'Reilly Press

data scientists self-eval; source: O’Reilly Press

how big is your scientist's data? source: O'Reilly Press

how big is your scientist’s data? source: O’Reilly Press

w h i t e s p a c e

One further tidbit of note from this study is that not many with this particular job description were actually working within the petabyte range of data volumes. Gigabytes and terabytes were commonly cited, tending to drive a wedge between Big Data per se, and practicing data scientists. Does this argue for dataists getting involved only at the analytics stage, after the volumes have been reduced down a bit? Yes & No. Management consultant Nick Kolegraff thinks the schism is a natural one, and draws attention to the difference between one-off data projects and building analytic products. But his commenter, Bill Shannon, a biostatistician, dislikes not having data scientist involvement earlier on, during the ETL and research stage (see bottom of his blog post, first comment).

Jumping into the Fray

Favoring the jack-of-all-trades generalist sentiment:

“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”
                                                                                                             ~Josh Wills (@Cloudera)

Where might you and your skillset fit into this mosaic? Mathbabe has some interesting material from a youthful, academic perspective, including a journal of a recent one semester course in Data Science offered at Columbia by ex-Googler Rachel Schutt. Over at Stats with Cats, Charlie Kufs offers a discussion with guidelines about how to evaluate your own suitability as a data scientist. He’s coming from a mature perspective with over 30 years of statistician experience. He proposes his own typology of sorts, classifying practitioners as either organizers or analyzers and either generalists or specialists regarding their methodology.

In conclusion, here’s some observations and advice from Josh Wills (courtesy of Mathbabe) who guest-lectured during the Columbia class. Note that he also comes from the search giant’s culture; for better or for worse, his slant might be Google-centric.

About the job of a data scientist:

         • I spend all my time doing data cleaning and preparation. 90% of the work is data engineering.
         • On solving problems vs. finding insights: I don’t find insights, I solve problems.
         • Start with problems, and make sure you have something to optimize against.
         • Parallelize everything you do.
         • It’s good to be smart, but being able to learn fast is even better.
         • We run experiments quickly to learn quickly.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: