Wednesday, July 29, 2015

Taming The Big Data Beast

The future of big data has to do with the “Big Variety” problem or what Stonebraker likes to call “the 800-pound gorilla in the corner.” The problem dates back to the emergence of data warehouses in the 1990s and the need to “clean” and integrate the data coming from a number of data sources, making sure it conforms to a global data dictionary (e.g., “salary” in one data source is a synonym for “wages” in another). The process of data integration invented then, Extract-Transform-Load (ETL), is still used today.  But it doesn’t scale, argues Stonebraker, failing when you try to integrate data from thousands of data sources, increasingly a “business as usual” reality for many enterprises trying to tap the abundance of public sources now available on the Web, to say nothing of what’s to come with the emergence of the Internet of Things and the yet-to-emerge new-data-generating technology.

“The trouble with doing global upfront data models is that no one has figured out how to make them work,” says Stonebraker. “The only thing you can do is put the data together after the fact.” The solution is a mix of automated machine learning and the crowd sourcing of domain experts and the resultant startup, Tamr, was launched in 2013.

Despite businesses being told for years to abolish their data in order to thrive, award winner, Stonebraker says that we must instead preserve the “data silos”.  Stonebraker and his collaborators came up with a way to mask the data silos, super-imposing on them a layer of software that adapts to the constantly changing semantic environment of the organization, based on a human-computer collaborative process.

Preserving data silos is also about the future of the business. “Agility is going to be crucial to successful enterprises,” says Stonebraker, “and I don’t see a way to do that realistically without decomposing into independent business units. The minute you do that you either anoint a Chief Data Officer to keep everybody from diverging or you say, ‘look, run as fast as you can.’ I would err on the side of agility rather than standardization.”

Tamr is his “fourth attempt at doing data integration,” Stonebraker says, “and I think we finally got it right.” Not getting it right happens when you make it happen.  All of Stonebraker’s solutions involve some innovative take on known technological tradeoffs, all balanced against the cost of not just the technology but also the people and processes around it. The solution may not get one or more of the components right or will incur unforeseen costs when it is implemented in the real world. Or the timing could be off. “If you are too late, you are toast, if you are too early you are toast,” says Stonebraker. “There’s a lot of serendipity involved. You have to guess the market and lead it.”

Is there a future beyond the future defined by the three startups Stonebraker is currently involved with? “Right now I’m not interested in starting any more companies,” Stonebraker says flatly. But then he adds: “If I had more bandwidth, it would be what I’m working on at MIT right now, what we call Polystores.”

Again, this is an age-old problem, tackled before with the not-too-successful concept of “federated” databases. Today, it’s an extension and expansion (in my opinion) of the Big Variety problem, what happens after the data has been “curated” (cleaned and integrated). Following his strong convictions about the advantages of special-purpose databases and given the proliferation of not just sources of data but also data types, Stonebraker suggests that “it makes sense to load the curated data into multiple DBMSs. For example, the structured data into an RDBMS, the real-time data into a stream processing engine, the historical archive into an array engine, the text into Lucene, and the semi-structured data into a JSON system.”

Big Data’s future is really about simplifying the applications that are deployed over multiple, special-purpose database engines. “If your application is managing what you want to think of as a single database which is in fact spread over multiple engines,” says Stonebraker, “with different data models, different transaction systems, different everything, than you want a next-generation federation mechanism to make it as simple as possible to program.”

With Stonebraker’s “make it happen” attitude, we may just be one step closer to turning Big Data into Smart Data.

No comments:

Post a Comment