LONDON: Google formally took the wraps off Cloud Dataflow, the hosted offering designed to allow developers with average Java and Python skills to build sophisticated analytic “pipelines” that process huge amounts of data.
Google introduced Cloud Dataflow about a year ago as a next-gen platform for building systems that can ingest, transform, normalize, and analyze huge amounts of data—well into the exabyte range, Google executives said. The software uses Hadoop and Spark under the covers, and relies heavily on Google’s Flume Java and MillWheel technologies to move data within the hosted Hadoop infrastructure, but there’s not a trace of classic MapReduce.
The idea behind Dataflow is simple: By concealing the underlying complexity of a big data setup behind straightforward SDKs and APIs—and offloading the infrastructure elements to the Google Cloud–big data analytics can be put within the reach of mere mortals, as opposed to confining it to the realm of data scientist superheroes with million-dollar pedigrees.
“Up until now, big data has not been accessible to your average developer or your average analyst in an enterprise because it’s just too hard,” says Tom Kershaw, Google’s director of product management. “Big data is not easy to do. Spinning up Hadoop cluster and programing MapReduces and all of those things are something that only data scientists have been able to do.”