“Analyzing the Analyzers” is a free e-book available from O’Reilly Press which takes the tack of surveying some working data analysts themselves. The sample size is not huge, as might be expected, but some worthy insights emerge from their in-the-trenches point of view. Both graphics below are from this effort. The first attempts to abstract a typology of four basic dataists: businesspeople, creatives, developers, and researchers. These are not cleanly disjoint categories; skillsets overlap and different organizations might arrange teams with differing ingredients, depending on what resources already are around and also their size.
w h i t e s p a c e
w h i t e s p a c e
One further tidbit of note from this study is that not many with this particular job description were actually working within the petabyte range of data volumes. Gigabytes and terabytes were commonly cited, tending to drive a wedge between Big Data per se, and practicing data scientists. Does this argue for dataists getting involved only at the analytics stage, after the volumes have been reduced down a bit? Yes & No. Management consultant Nick Kolegraff thinks the schism is a natural one, and draws attention to the difference between one-off data projects and building analytic products. But his commenter, Bill Shannon, a biostatistician, dislikes not having data scientist involvement earlier on, during the ETL and research stage (see bottom of his blog post, first comment).
Jumping into the Fray
Favoring the jack-of-all-trades generalist sentiment:
“Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.”
~Josh Wills (@Cloudera)
Where might you and your skillset fit into this mosaic? Mathbabe has some interesting material from a youthful, academic perspective, including a journal of a recent one semester course in Data Science offered at Columbia by ex-Googler Rachel Schutt. Over at Stats with Cats, Charlie Kufs offers a discussion with guidelines about how to evaluate your own suitability as a data scientist. He’s coming from a mature perspective with over 30 years of statistician experience. He proposes his own typology of sorts, classifying practitioners as either organizers or analyzers and either generalists or specialists regarding their methodology.
In conclusion, here’s some observations and advice from Josh Wills (courtesy of Mathbabe) who guest-lectured during the Columbia class. Note that he also comes from the search giant’s culture; for better or for worse, his slant might be Google-centric.
About the job of a data scientist:
• I spend all my time doing data cleaning and preparation. 90% of the work is data engineering.
• On solving problems vs. finding insights: I don’t find insights, I solve problems.
• Start with problems, and make sure you have something to optimize against.
• Parallelize everything you do.
• It’s good to be smart, but being able to learn fast is even better.
• We run experiments quickly to learn quickly.
~RS