Introduction
We already know that business intelligence (BI) can bring many benefits to an organization. Through consolidating, aggregating, and analyzing data, BI can provide many insights into what is currently happening, as well as what is going to happen within the organization. BI allows for identifying trends of where an organization is going or should be going. The road to BI usually starts with extract transform and load (ETL). ETL is, generally speaking, a process in data warehousing that involves:
- Extracting data from external and internal data sources.
- Transforming the data to fit business needs.
- Loading the transformed data into the data warehouse (or data mart).
Basically, for us to achieve BI Nirvana, all we need is "just" one input: data. BI needs the data that is hidden within the organization's systems.
Enter SOA
In the last few years, we have seen the advance of service-oriented architecture (SOA) to the forefront of IT architectures. As the hype begins to clear and organizations make the transition to SOA, the data that BI requires is suddenly scattered between multiple services and hidden behind contracts.
Looking at the SOA components in Figure 1 (taken from my paper What Is SOA, Anyway?), we can see that apart from the obvious component—the service—SOA has several other components that are related to the interface of the service:
- Contract that the service implements
- Endpoints, where the service can be contacted
- Messages that are moved back and forth between the service and its consumers
- Policies to which the service adheres.
- Consumers that interact with the service
This, along with SOA tenets like "share schema, not data" and "services should be autonomous," tells us SOA really cares about its interfaces. This emphasis on communication through rigorously defined interfaces is exactly what brings technical and business advantages of loose coupling, flexibility, and agility to SOA.
Figure 1. SOA components and their relations
There is a real impedance mismatch here, with BI pooling in one direction of intimate understanding of the data and SOA pooling in the other direction of isolating internal data behind interfaces. As Pat Helland explains in Data on the Outside vs. Data on the Inside, the service's internal data should never be exposed outside of the service, yet this is the very data that BI wants. Thus, when we think about this, it is not very surprising that a recent survey by Ventana Research (published by Dan Everett in Dr. Dobb's Journal) shows that only one-third of respondents reported that they believe their internal IT personnel to have the knowledge and skills to implement BI services.
It seems that there are two options: either to go directly at the data and invalidate some of our SOA principles (like "share schema, not data") or to try to make do with the contracts that we have in place and hope that we will have enough data for BI. (A third option, which I will discuss later and is equivalent to the first option, is to create contracts specifically for the BI needs.)
Get That Data or Else...
The first option is to get the data that BI needs by using the same ETL processes that have proven themselves in the past.
SOA presents a little challenge to ETL, as you have to integrate data from many dispersed locations (services). However, converging data from multiple resources is not a new problem for either BI or ETL. Large enterprises already have a lot of data sources: ERP, CRM, all of those departmental data silos, and whatnot. ETL with SOA might even be easier, considering that SOA promises that the enterprise data would be woven into a cohesive fabric and not some point-to-point integration spaghetti. ("Promises" being the operative word here; actually achieving a cohesive fabric of services is not an easy feat. But that is a topic for another article—or book, for that matter.)
As I mentioned earlier, ETL is mature and has a proven record as a basis for building successful BI solutions. However, using ETL basically negates most of the benefits that made us pursue SOA in the first place. One of the main problems in the pre-SOA era (which is still the reality in many organizations) is what is known as integration spaghetti. Consider the situation in Figure 2. Historically, each department builds its own systems. The result is isolated or stovepipe systems, as new business requirements emerged. Then, systems needed to share data, and new point-to-point interfaces were added to solve the integration needs. As people use the systems, they find that they need information from other systems, and point-to-point integration emerges. Figure 2 shows four types of point-to-point integrations: ETL (extract, transform Load), which is a DB-to-DB relationship; online and file-based, both of which are application-to-application relationships; and direct connection to a DB, which is an application-to-database relationship. Note that this is not an exhaustive list; there are additional relation types, such as replication, message-based, and others that are not expressed in Figure 2.
The end result is a spaghetti of systems. Making changes in one system has ripple effects, with results that are unpredictable. The SOA emphasis on general interfaces and autonomy aims to solve these problems.
Figure 2. Typical enterprise-systems integration spaghetti>
Adding ETL as a direct pipeline into the services' data just adds a new point-to-point interface—cracking the SOA "interface armor" and introducing a dependency between the BI and the service. It also opens the door for other workarounds. (If we can do that for BI, why not do the same for other applications, services, or systems?)
A variation on doing ETL can be to replicate the SOA data into an external database, and then do ETL on that data. However, it is exactly the same as using ETL on the service's database, as we are still bypassing the contract and we are still coupled to the structure of the internal data.
Okay, so using ETL is probably not the best option. So, let's try to see if the second option of building on the SOA principles by using contracts will fare better at integrating BI and SOA.
Pulling SOA Data (Request/Reply)
The simplest solution for integrating SOA and BI is not to do anything specific for the BI processes. Instead, what if we use the existing contracts—those that were drafted as part of the SOA initiative? To be able to fulfill our BI needs, we would need to poll the services' interfaces on a regular basis, so that we can get trend and historic data.
There are basically two problems with this approach. One is the problem of network bandwidth. Polling each of the services that we need transfers a lot of data on the wire. To solve this problem, we might want to increase the interval in which we poll the services. However, in doing so, we hit the second problem: We run the risk of missing important events that occur during the interval. This is analogous to looking at the sky in the morning and the afternoon, and completely missing a solar eclipse that happened at noon. Thus, unfortunately, this does not look like a very promising direction, although it is probably better than nothing.
Another option for using SOA contacts is to build a specific contract that would serve the BI needs; that is, the contract will enable retrieving data from the internal structures of the service and so that the BI can use it. However, that is pretty much the same as using standard ETL; you still create a point-to-point integration and a precedent of specific contract for a specific use.
The situation thus far does not look very promising. We find that we are between a rock and a hard place; if we are pulling SOA data, we must either invalidate some of SOA's principles and benefits or forget about a good BI solution.
But, hey, wait: Maybe there is a third way, after all...
Making an SOA Mind Shift: Moving to a Push Model
The third option is based on taking SOA forward, beyond the simple request/reply that we are used to thinking about, and combining SOA with another architectural style that is called event-driven architecture (EDA).
In a nutshell EDA, like SOA, is an architectural style that is built on the push model. EDA components publish events. In the logical sense, an event is any significant change in the component that publishes the event. The change can be a result of proper conduct, such as an order than has been processed; it can be a fault, such as a database that is down; a threshold that was crossed, such as the millionth customer making a purchase; or anything else that seems important. In the physical sense, events are messages with a header describing the metadata of the event and the body containing the content.
As soon as they are produced, events ripple through to subscribing components. After processing the event, these components can also produce new events, and so on. For example, in an airline scenario, an event can be a notice that a flight is delayed. This event can trigger another component that is responsible for connecting flights, to try to find alternate flights for the passengers arriving on the delayed flight. (Yeah, right; as if we will see that ever happening.) A unique characteristic of EDA versus other push technologies is its notions of event stream processing (ESP) and complex event processing (CEP). Instead of treating the events as isolated occurrences, we look at them as a chain of related events. Looking at an event chain—and, even more so, at a combination of several event chains (event cloud)—allows retrospective analysis over time, as well as other advanced analysis of event patterns.
EDA can be used independently of SOA; but fusing them together can be very beneficial.
SOA Meets EDA
What if we add publication messages into the contract? By "publication messages," I mean that the service will publish its state either in a periodic manner or per event to anyone who might be listening. I like to call this service-communication pattern "inversion of communications," because it reverses the request/reply communication style that is the common case for SOA. While it might look like we get a similar network load that polling the services would, the network load is much less. However, using inversion of communications, each interested service consumer would get an event only once, at most, while polling a consumer would get the same state change multiple times (or miss out on data).
To make the solution complete, you can add additional request/reply or request/reaction messages to allow service consumers to retrieve initial snapshots. Following this approach, you get an event stream of the changes within the service in a manner that is not specific for the BI. In fact, having other services react on the event stream can increase the overall loose coupling in the system; for instance, it can allow caching the state of other services and ease the temporal coupling between services. Additionally, adding EDA to SOA can serve as the basis for solving the reporting problem of SOA, by implementing the aggregated-reporting pattern (early draft).
EDA on SOA solves the BI problem; as soon as you have event streams on the network, the BI components can grab that data, scrub it as much as they like, and push it to their data marts and data warehouses. However, event streams can also enhance the BI itself by enabling much more complex and interesting analysis of real-time events and real-time trend data, using complex event-processing (CEP) tools to get real-time business-activity monitoring (BAM). What would event processing look like? Imagine that you have an Orders service that publishes an event with an XML description of every order that it processes—something like Listing 1.
OrderDate="2007-04-02T00:00:00"
DueDate="2007-05-15T00:00:00">
Listing 1. Excerpt from an Order summary XML
We can then use ESP or CEP tools to monitor this stream and continuously extract interesting events for further analysis or further actions. For example, Listing 2 shows a query on such an order stream to find orders that are larger than $100,000. Note that while the query looks suspiciously like SQL (from which it was derived), it is also quite different; the query continuously runs on a non-persistent stream of events.
INSERT INTO LargeOrders
SELECT
orderid as orderid,
SUM(Ords.price * Ords.qty) AS TotalValue,
FROM
OrdersStream AS Ords XMLTable (val
ROWS '//OrderLine'
COLUMNS
Orderid as orderid,
TO_FLOAT (XMLExtractValue ('@Price')) AS price,
TO_FLOAT (XMLExtractValue ('@Quantity')) AS qty );
WHERE
TotalCost>100000
Listing 2. A Coral8's Continuous Computation Language query to find orders larger than $100,000 in an order stream and insert them into a LargeOrders table
The road to mainstream CEP tools is still long, but there are several vendors working on solutions. Even if we do not use CEP, we can still gain a lot of benefit from receiving these events. For example, a service that manages the stocks in the warehouse can listen in on the Order service's orders-processed stream and then take care of ordering new stocks, securing available items, and so on.
When we build our BI with EDA on SOA, we essentially create the BI as a mash-up of services. We can take that even further and have the BI component itself expose its trend data and other analysis results as a service. We can then consume that data and use it in other applications. For instance, if the CEP query in Listing 2 will generate an event every time that an order exceeds $100,000, we can present a nice dashboard on the CEO's portal that will show in real time how many large orders the organization processes per hour/per day, and so on, along with a few other meaningful gauges.
Figure 3. Displaying the BI as a mash-up
We still have not answered one question, however: How can our services produce these events?
But What About Request/Reply?
Looking at the implementation side, we can see that the infrastructure to support this move is already emerging or even present. If you are implementing SOA over an ESB, that is rather easy to implement, as most ESBs support publishing events out-of-the-box. Using the WS* stack of protocols, you have the WS-BaseNotification, WS-BrokeredNotification, and WS-Topic set of standards.
If you are in the Representational State Transfer (REST) camp or do not want to get into the complexities of relatively immature WS protocols aforementioned, I guess you will need to implement publish/subscribe by yourself. But, then, we already have that solved, too: It is called RSS. When someone posts on a blog, your RSS reader uses synchronous request/reply to get to that blog and get the posts that were added since the last time that the RSS reader asked. Well, well, guess what: RSS gives us loosely coupled publish/subscribe, including topics (categories) built on top of synchronous request/reply, too.
Your services can publish their event streams as feeds, just like your blog, which as a bonus also gives as a few architectural benefits. For one, the service does not have to manage subscribers. Secondly, the consumer does not have to be there the moment that the event occurs to be able to consume it. Also, the management and setup are easier and simpler than using queuing engines or any other technology that I can think of.
Conclusion
Using EDA and SOA together gives us a solution that does not break SOA and solves BI requirements. However, there are two challenges to the EDA and SOA approach. One is that there is not a lot of experience using EDA and SOA as a BI solution (compared to ETL, which is proven). The other is that it needs more work or even rework, as the first wave of SOA implementations builds on the more basic synchronous-messaging approach. Adding EDA to an existing SOA solution is not a small effort. However, neither is using ETL within SOA, because we need to go out and extract data from many sources, as each service holds its own internal data and we are likely to have quite a few of them for any reasonably sized SOA initiative.
My opinion is that, overall, EDA and SOA wins over using ETL from almost all of the perspectives.
From the SOA perspective, adding EDA to SOA is good for the overall SOA initiative. EDA is a valuable tool for building services that are more autonomous. For example, services can now cache relevant data from other services and get notifications when that data changes. Thus, the consuming service can be decoupled in time from the services with which it interacts and not depend on their availability—which is the situation when synchronous request/reply is used.
From the BI perspective, things are even better. Utilizing EDA can give us something that was really hard to achieve by using traditional BI mechanisms—which is real-time insights. Using the EDA-generated event stream, we can now get data in real time and, using CEP tools, we can process it to act in real time and handle the emerging trends as they appear.
To summarize, implementing a BI solution by using EDA and SOA is superior to using traditional ETL. Not only do we get our basic BI, but we actually get better, real-time BI—not to mention improvement in the overall quality of our SOA.
About the author
Arnon Rotem-Gal-Oz is a manager and architect with extensive experience in building large, complex, distributed systems on varied platforms (ranging from HP-UX and Solaris to AS400 and Windows). Arnon blogs at www.rgoarchitects.com/blog and writes the Dr. Dobb's Portal blog on Software Architecture & Design at www.ddj.com/dept/architect. You can contact Arnon at arnon@rgoarchitects.com.