In 2005, “Enterprise Information Integration: A Pragmatic Approach” was released and expressed a methodology for integrating data spread across disparate data sources leveraging the leading technology at the time, which included Service Oriented Architecture (SOA), Web Services, XML, Resource Definition Framework (RDF)--and XML-based metadata syntax--and Extract, Transform & Load (ETL). EII solutions produce an outcome that closely approximated a unified view of related data elements, but lacked the performance characteristics required to replace data warehousing and multi-dimensional databases. Five years later we saw significant technological shifts that not only counteracted the guidance against operating on decentralized data, but simplifies the aggregation of disparate data under a single container capable of delivering critical insights into that data.
The technologies that rapidly transformed the data management landscape were virtualization, low cost storage, cloud computing, NoSQL databases. and Hadoop. When we talk about virtualization, it goes far beyond the concept of providing a software instance of a physical server. Today we can virtualize the server, storage, and network. All this virtualization means that we are not locked into a physical configuration, but can rapidly restructure our physical environment to best support the processing needs we have any given moment. When processing Gigabytes, Terabytes and Petabytes of data the ability to orient around these processing needs was tantamount to being able to start the move away from a structured data warehouse; we no longer needed a special environment dedicated to just exploring one aspect of the business.
Low cost storage eliminates the need for business to be frugal in their use of data storage. With high storage costs businesses were forced to seek solutions that would give them critical analysis of their business efforts based on a limited amount of data. It forced the business to be very selective about what data was most important and limited the quantity of that data the system could process. The downside of this is that the business may have chosen poorly , that there wasn’t enough history to properly identify a critical pattern, or the business may have just been generally dissuaded due to costs and did their best to identify patterns anecdotally.
Cloud computing provides a metered way to access the scale needed to operate on very large data sources to produce a result in a reasonable time frame. Processing big data requires two things: access to elastic storage, access to CPUs. High speed networks help, but are not mandatory as we will see when we explore the role of software in big data. Elastic storage means that businesses are not restricted to the amount and types of data they would like to operate on. This mitigates the risk of not getting the best answers as was the case with the data warehousing approach. Access to more CPUs means that businesses can derive an answer in a time frame that meets their budgetary concerns.
NoSQL databases have provided a means of supporting very large scale data without the tinkering, tuning and engineering required to do this with traditional relational database technology. Moreover, most NoSQL databases are open source, meaning this can be done without paying the significant licensing overhead that would be required to operate on a couple of Petabytes of data. NoSQL also offers tremendous flexibility in table structure, which does not need to be defined in advance. With regard to facilitating a composite view of disparate data sources and providing an alternative to EII as the means to accomplish this, this one feature is probably the most important aspect of NoSQL databases.
NoSQL databases also typically offer built in data redundancy and distributed data storage. On the of the biggest issues deal with very large scale data is disk I/O. NoSQL mitigates many of these issues by distributing the data across a number of nodes. Hence, when a request is made to retrieve the data these nodes operate in parallel to retrieve the elements they own and, thus, are not dependent upon one disk, one disk array, one network connection, etc. By reducing I/O bottlenecks in data retrieval data can be retrieved much more quickly.
Finally, we have Hadoop, which combines the power of all these aforementioned technologies into a framework for dissecting and analyzing data. Some confuse Hadoop with actually being a NoSQL technology when in fact Hadoop is a Java framework of distributed components that breaks down the job of “eating the elephant” (pun intended as Hadoop is named after its founder’s son’s stuffed elephant) one bite at a time.
Hadoop itself is actually fairly independent of the data being processed. It provides the infrastructure for turning big query jobs into a multiple little query jobs that get processed in parallel. The results are then collated and aggregated to return an answer. Hence, Hadoop is a framework for parallelizing queries against NoSQL databases that leverages nodes made available by cloud computing, which runs on low cost storage and virtualization.
Kickin’ Old School
When EII first emerged as a practice in 2003-2004 time frame, a key element to the approach was to stop having to move the data. Given that most data centers were still operating with sub-Gigabyte networking and limited storage for duplicated data, this was understandable. Hence, EII was the right solution at that time given the state of the available technology and the problem domain. However, there are aspects of EII that are apparent even in big data solutions.
One of these aspects is bringing the process to the data. One of the key architectural elements of a big data solution is that the process moves to the data and not vice verse. With EII a key tenet was to use the query facilities of the data where it resided. The recommended practice was to develop a Web Service in near network proximity to the data source that facilitated a common query interface for applications but translated that into a query to the local database. In this way, we unlocked the data from its proprietary format based on an open web-based interface. This practice enabled subsets of data to be aggregated quickly to provide the appearance of a single view.
With the advent of low cost storage and 10G networking, we become less concerned with data redundancy and moving the data from source. However, this introduces its own sets of problems. One of the issues with data warehousing is lack of assurance that the data is pristine. With EII we pushed the concept of the “gold standard” for data that by using the data from its original source you were assured that the information was not modified and was accurate.
Since big data prescribes moving the data to a new physical location, once again there is the issue of assurance. Hence, EII practices for obtaining the baseline data are still relevant and important. In fact, those Web Services interfaces that were developed for EII may end up playing a major role in jumpstarting you big data initiatives.
Of course, no discussion of data management would be complete without covering security. EII still holds one advantage over big data solutions still in this area. While technologically, big data solutions are faster and more agile for delivering an integrated view of data, most lack any type of inherent security. The lack of security is often by design since this would add processing overhead. However, the lack of security means that it is more difficult to limit access to data that the source systems may have controlled implicitly. Since EII is querying the data in the source system, it requires that the process making the query have the appropriate credentials or the query will fail.
Of note, the aforementioned security discussion approaches the topic of security from an implicit security perspective. It very reasonable to integrate access control lists into the data sets and ensure security is maintained explicitly as part of the query. However, anyone with the ability to directly query the NoSQL data source would have complete unfettered access to all your data.
Summary
To quote the old Virginia Slims ads, “We’ve come a long way, baby!” Advances in technology as discussed in this article have had a significant impact on the way we design solutions regarding data in the second decade of the 2000’s. Commoditization and minaturization have once again removed barriers that forced a scarcity mindset to allow architects to focus solely on the problem instead of just pragmatic and implementable solutions to problems. For pennies an hour for each resource we can instantiate a 10,000 node processing engine that can tear through a Petabyte in seconds unlocking the potential of all the data trapped therein.
With these new tools, we must revisit our thinking regarding how we manage our data moving forward. Why maintain silos of data that are not well integrated and cost hundreds of thousands of dollars, if not millions, to integrate and mine for data intelligence. Data management has been the bane of most every mid- and large-sized enterprise. It has been costly to store, manage, access, integrate and query, but it does not need to continue to be the case any longer.
About the Author
JP Morgenthal is one of the world's foremost experts in IT strategy and cloud computing. He has over twenty-five years of expertise applying technology solutions to complex business problems. JP has strong business acumen complemented by technical depth and breadth. He is a respected author on topics of integration, software development and cloud computing and is a contributor on the forthcoming "Cloud Computing:Assessing the Risks" as well as is the Lead Cloud Computing editor for InfoQ.