By Evan Levy
I recently talked to a client who was fixated on a hub-and-spoke solution to support his company’s analytical applications. This guy had been around the block a few times and had some pretty set paradigms about how BI should work. In the world of software and data, the one thing I’ve learned is that there are no absolutes. And there’s no such thing as a universal architecture.
The premise of a hub-and-spoke architecture is to have a data warehouse function as the clearing house for all the data a company’s applications might need. This can be a reasonable approach if data requirements are well-defined, predictable, and homogenous across the applications—and if data latency isn’t an issue.
First-generation data warehouses were originally built as reporting systems. But people quickly recognized the need for data provisioning (e.g., moving data between systems), and data warehouses morphed into storehouses for analytic data. This was out of necessity: developers didn’t have the knowledge or skills to retrieve data from operational systems. The data warehouse was rendered a data provisioning platform not because of architectural elegance but due to resource and skills limitations.
(And let’s not forget that the data contained in all these operational systems was rarely documented, whereas data in the warehouse was often supported by robust metadata.)
If everyone’s needs are homogenous and well-defined, using the data warehouse for data provisioning is just fine. The flaw of hub-and-spoke is that it doesn’t address issues of timeliness and latency. After all, if it could why are programmers still writing custom code for data provisioning?
When an airline wants to adjust the cost of seats, it can’t formulate new pricing based on old data—it needs up-to-the-minute pricing details. Large distribution networks, like retailing and shipping, have learned that hub-and-spoke systems are not the most efficient or cost-effective models.
Nowadays most cutting-edge analytic tools are focused on allowing the business to quickly respond to events and circumstances. And most companies have adopted packaged applications for their core financial and operations. Unlike the proprietary systems of the past, these applications are in fact well-documented, and many come with utilities and standard extracts as part of initial delivery. What’s changed in the last 15 years is that operational applications are now built to share data. And most differentiating business processes require direct source system access.
Many high-value business needs require fine-grained, non-enterprise data. To move this specialized, business function-centric content through a hub-and-spoke network designed to support large-volume, generalized data is not only inefficient but more costly. Analytic users don’t always need the same data. Moreover, these users now know where the data is, so time-sensitive information can be available on-demand.
The logistics and shipping industries learned that you can start with a hub-and-spoke design, but when volume reaches critical mass, direct source-to-destination links are more efficient, and more profitable. (If this wasn’t the case, there would be no such thing as the non-stop flight.) When business requirements are specialized and high-value (e.g., low-latency, limited content), provisioning data directly from the source system is not only justified, it’s probably the most efficient solution.

Evan - Nicely boiled down. Thanks. One point I'd add is that in many network-effect endeavors (like knowledge storage/organization) there are negative returns to scale after some point, due to the exponential costs of coordination among the data/semantics/applications relevant to each additional domain (marketing, finance, ops, etc). The "Economies of Scale/Scope" argument works, but only within a natural range.
Posted by: Scott Davis | March 10, 2009 at 10:37 AM
Hey Evan -
Nicely done.
I don't think the issue is hub-and-spoke architecture per se, but the mis-application of that (proven) architecture.
I think the problem is that DW architects are trying to schlep around waaaay too much data (most of it never used), and are using a tightly-coupled architecture. It's overly complex, cannot scale, and can't meet real-time business BI needs.
A better way is to use existing ESB or MOM infrastructure to subscribe to the business events that provide data of interest.
Posted by: Marty Moseley | March 12, 2009 at 10:44 AM
I keep getting error messages where this architecture is being used. "No spoke data". Do you think that the software is at fault? It is unknown if the data is being transmitted to the receiver and not replied or if the data is not being received by the receiver.
Posted by: george okeeffe | August 20, 2009 at 08:27 AM
George,
It sounds as though your current environment hasn't implemented a reliable delivery mechanism. Most hub-and-spoke architectures don't make data available (at the hub) unless they've implemented a method to ensure that all data sent is delivered.
We often find that send/recieve problems are associated with custom point-to-point data migration mechanisms. That's one of the reasons so many folks have implemented an enterprise service bus (ESB) to replace their custom solutions. An ESB can ensure delivery of all sent data.
If you have any other questions, you're welcome to contact me directly.
E.
Posted by: evan levy | August 20, 2009 at 10:17 AM
When you say "provisioning data directly from the source system is not only justified, it’s probably the most efficient solution.", Are you referring to Independent DataMart?
Posted by: Bala | August 25, 2009 at 03:13 PM
Thanks for the question. Data provisioning isn't limited solely to data marts. My remarks regarding provisioning data directly from soure systems wasn't focused at BI systems or data marts. It was addressing data provisioning in general.
I think it's important to consider that systems of all types (operational, analytic, etc.) may need non-enterprise, non-standardized data to support their business functions. A DW positioned as a data provisioning hub may not solve every possible data access need (e.g. a CRM system wanting an updated phone number or recent bill payment details).
Posted by: Evan Levy | August 25, 2009 at 07:38 PM