Netflix has been open and unrepentant about publicizing it’s strategy for cultivating value from all the event data at it’s fingertips due to it’s streaming service user base. Their data miners have been taking things far beyond the level of Amazon’s ‘smart’ book recommendations. This new drama, premiering tonight, has been statistically vetted regarding timeslot, programming, script, and even cast with heavy input from their S3/Hadoop cloud (they use AWS for their platform) of user preference and activity data. Netscape execs bid heavy and won rights to the program, two seasons worth, pilots be damned, because of the confidence they have in their statistical data analyses.
From a technical standpoint, this conference presentation by Kurt Brown, Data Science director at Netflix, gives an interesting evolutionary look at their inner BD workings, from about 2008 to the present. Oracle’s role in their architecture is strictly as a backend feeder for dimensional data up the channel to where the analyses happen. They have a multi-tool environment, consisting of Teradata, Cassandra, S3, Hadoop, Hive, Pig, and various homegrown metatools. Some highlights about their infrastructure design: (1) the core of everything is their S3 repository, which is replenished by backend data warehouses and grows also with unrejected results from frequent mining explorations; and (2) they have dual Hadoop clusters differentiated by SLA demands. One of the Hadoop clusters is earmarked for business critical batch jobs, and it receives an infusion of excess nodes overnight when development work dies down. The other cluster is for “cowboy” work, less regulated, and used for spur of the moment queries.