One of the great things about adding new, senior talent to a storied team working on a large, complex, and successful enterprise application solution is the critical technical review that results in a lot of “why did/didn’t you do it this way?” questions. You have two options for responding to those questions – ignore or passively dismissing them, or taking the questions seriously as a challenge to prove out why you would make a decision you and your team made 5 years ago the same if you had to consider for the first time today, in today’s frameworks, development methodologies, and the current team makeup and skills inventory. If you choose to dismiss these opportunities to critically review your prior decisions, it says a lot about your management style, general appreciation of technology and response to its change, and positions your team to take a reactionary, defensive posture to architecture rather than create a team that honors a proactive, continuous improvement perspective. Far more interesting too are those questions that ask why the system is architected in a general way, rather than a theological debate on whether a particular technology component choice is superior to all over or one’s preferred/familiar choice.
The particular question the new engineer asked was, “Why aren’t we using a service bus?” Instead of answering him directly, I figured this as a good opportunity to explore the previous decision we made that not only did not include an enterprise service bus (ESB) in the original design, but rejected its inclusion when it was strongly suggested by our first customer because they were standardizing on a service bus-centric architecture themselves. The primary advantage of a service bus is to layer an abstraction across heterogeneous systems by implementing a centralized communication mechanism across components. By applying this architectural model, you can get some key benefits including orchestration, queuing to handle intermittent component availability, and extensibility points for message routing to alter dispatch logic or transform messages. Implementing the service bus pattern requires some kind of adapter to be written for each component of the system, either as a local modification to each component or by choosing to standardize on a communication channel provided by the ESB. Even in the latter, usually some minor accommodation is required to allow the ESB to receive and encapsulate the native message for delivery to the destination component. Our first customer was a notable player in the community banking market, and was productizing multiple new SaaS-based web applications that depended on data feeds coming from many different customers. In their scenarios, data was consumed by one application, parsed, and delivered to other applications, which in turn may have created additional data feeds for other products, in a cyclic communication/dependency non-directed graph. Each application was developed by different teams, and there was no unified technology stack adoption – some teams were developing on EJB and Flex, others were pure .NET, and teams generally had the discretion to choose whatever they could argue would solve the job, without a strong technology leader looking to unify the stack for similar applications that delivered CMS and pseudo-online banking functionality using a common input data set.
For this customer, ESB was a solution to a problem – their choices lead to a highly concurrent development process with multiple independent teams – but also supported connecting a heterogeneous environment of interdependent components, each of which accomplished limited objectives. This organization was running red-hot – developing ancillary products to a highly engaged and fanatic client base of community banks, where their limiting factor was their speed of innovation and delivery. By agreeing on a common communication mechanism that ESB could provide, there was something, albeit low-level, to which all teams agreed. In the ‘controlled agile chaos’ they found themselves in, the abstraction bought them flexibility to adapt changing business requirements using orchestration. In theory anyway – they ended up moving much slower than they anticipated, but this wasn’t the fault of ESB. ESB solves two classes of problems. The first is the common use case of large, disparate enterprises looking to marry systems established from the dawn of client-server architectures to the newest Node.js hotness, without having to bend the will of any particular system to the communication conventions of any other, which may prove impossible if both systems are proprietary. This is a common use case for BizTalk, especially in the financial sector. All the other benefits you can name off from a service bus architecture are really secondary advantages to this key objective. The second is the use case that any layer of indirection provides: an abstraction you can use to increase the speed of development when requirements are incomplete or prone to pivot. In each case, you invest in a layer to reduce the cost of future change. This particular customer chose NServiceBus as their message-oriented middleware. We seriously evaluated both the general architectural concepts ESB as well as the particular technology they suggested and came up with a definitive ‘no’ to that choice. While it made a lot of sense for our customer, it did not make sense for us because:
- We did not require guaranteed event handling. Our system connected to a system of record that provided transactional consistency, and virtually all state changes were initiated by users through a web browser. A timeout was preferable to queued command handling system because of the possibility of duplicate transactions that frustrated users may initiate, not realizing their requests were queued. Second, our interconnected systems did not provide guaranteed event handling, so the guaranteed provided by the ESB would now be honored end-to-end. Third, we are using the Windows Identity Foundation with sliding time expirations end-to-end from the user’s browser through the lowest layer of service components, which doesn’t bode well for delayed delivery situations, even if the user was willing to wait.
- We do require transformation, but not orchestration between our components. Our system features adapter-based design to allow multiple types of endpoints to be serviced by a single service implementation for those portions that may need to connect to a different type of third-party system through a provider model implementation loaded by dependency injection. We could have chosen to use ESB for this piece, however, we perceived the long-term maintenance cost of multiple providers with the party-specific transformation logic to be lower than maintaining those transforms in ESB scripting or adapters. In reviewing this perception today, I believe it was still the right decision because is allowed for us to unit-test our transformation logic without including the ESB.
- An ESB is a single point of failure that would independently need to scale for load exponentially proportional to the number of service interconnects in our solution, and would add some amount of latency between each. Because online banking is a mission-critical, customer-facing solution, it cannot have SPOF’s in any portion of the architectural design. The SPOF nature of an ESB can be mitigated in multiple ways, but we felt that was at least two layers of complexity we could solve in other, simpler ways.
- All middleware increases the Mean Time Between Failures (MTBF). This is not a risk specific to ESB, but of any layer added to a system. If you add an ORM, IOC, ESB, or even a logging aspect, something can go wrong with them. Each component has some small, but measurable failure rate, and when inserted into the communication chain between all components, its reliability of 99.999% still contributes to a reduction in the overall reliability of a serial system. This is where the KISS principle shines – complexity creates unreliability, so all complexity must generate a compelling benefit in excess of its potential to fail.
- We wanted our application layer to be the platform, we did not want ESB to be the platform. This was a business case / competitive advantage decision that we wanted to build as a feature of our system that the same services layer that supported our front-end user interfaces was also an open and extensible platform upon which our clients could integrate to, which would increase the overall value proposition of online banking not only as a sticky end-user experience, but also as a value proposition to capitalize on our solution as the middleware that marries together all the disparate systems within a financial institution, which ultimately online banking does like no other piece of technology within a bank or credit union. We felt that by positioning everything behind an ESB, the perceived value of our technology piece would be lessened without additional client education.
- MSMQ made us feel dirty enough, and we did not want to mandate it for each component because it was in 2009 and still is relatively difficult to debug, and lately we have learned, queues do not work well with used with Layer 7 network load balancing. The new hotness of 0MQ wasn’t around then, and while RabbitMQ was, it was arguably not production ready by that time. For us, production-ready isn’t just whether a component is capable, but whether it will have general acceptance from the IT departments of our large clients – many newer technologies that are FOSS or from vendors without an establish track record require a ‘sale’ and buy-in during due diligence, long before ink is applied to a contract. Even if they were options for the ESB queuing mechanism, they would not resolve the larger aforementioned concerns.
- At the time we made this choice, AMQP was an amorphous draft that did not solidify until later. The lack of a vendor-independent protocol between components and an ESB made the choice to utilize an ESB subject to vendor lock-in, which we were not willing to tolerate for such a critical component.
- Because our product was both the end-user experience and the middleware we were writing, we felt strongly that the application protocol should provide descriptive metadata and support fast client proxy generation using .NET-based tools. REST support was archaic at best (HttpRequest anyone?) in .NET 3.5, and to this day, consuming SOAP services is intrinsically more verbose in C# and VB.NET (HttpClient) than consuming REST or AMQP services due to a lack of better library and integrated language support for it. Looking back on this, with a large amount of iterative change we went through from ideation to Version 1.0 of our solution, we could not have moved as fast without a fast way to regenerate proxies that would cause build failures to alert us of service operation signature changes — tracking these down at runtime (REST) or having to debug a secondary system (ESB) to find these would have bogged down our delivery timelines.
- A lesser concern was we felt that tracing SOAP messages, while definitely more difficult than REST, would be more difficult debug any issues in AMQP or other ESB encapsulation protocols than inspecting SOAP envelopes with built-in WCF tools already present in the .NET development stack.
So, that’s quite a case against an ESB, but they do have compelling uses for certain environments – just not ours. Like all technology selection decisions, it’s important to pick the right tool for the job, and improve your tools as needed. A standalone ESB would have provided significant benefits if we were developing with proprietary/closed third-party systems that were part of a call chain that required orchestration, or if we were developing with a heterogeneous mix of technologies. In our case, we had a predictable homogeneous .NET environment based on web services, consumers of our API are our own technologies or a limited number of customers who were also using .NET, and we had no legacy baggage. With the widespread adoption of WS-* standards, we have chosen to obtain some of the benefits, such as federation, from those standards rather than an ESB feature, which ultimately we believe makes our platform easier to support and distribute for our future API consumers. Other side benefits such as logging are handled as separated concerns through dependency injection rather than external interceptors in a communication channel, a possibility for us only because we control the portion of the stack that requires orchestration. And finally, by keeping all communication as SOAP over HTTP/HTTPS, we gain features like load balancing from Layer 7 network devices instead of an ESB process, which are much easier to switch out and upgrade.
The central design decision we made was that ESB’s provide some great features and that ties you into an ESB, but if we could get those features another way that was just as convenient or more so, we’d prefer the plug-and-play flexibility of leveraging existing solutions for components such as caching and load balancing in the environment our solution operates, or pick those pieces ad-hoc for those concerns rather than pick the best omnibus solution and work around any specific shortcomings for any one of them. In reviewing the current industry literature and blog posts and looking at general trends, it would seem our decision not to marry our solution is generally the path many take when not required to integrate legacy systems as part of an orchestration chain or when using non-HTTP based transport mechanisms. If you’re using one, hopefully it’s for a good and necessary reason! For us, though, we decided not to hop on a service bus that could take us somewhere we already arrived.
* As an aside, we actually did end up rolling our own small “ESB” as a TCP port multiplexer that queues and portions out connectivity to a socket-based, legacy third-party component that has no listener back-queue and no port concurrency, highly unusual for a server process. Each connection consumes the port fully for the duration of the short transaction, and we had to write a way to buffer M number of requests and hand them off to (M-N) number of available ports as they became available,in a specialized type of producer-consumer problem. In hindsight, this was an opportunity to use an ESB, but in our case, we only required message routing and load leveling, and in a few hundred lines of code, we implemented what we needed for this particular third-party system what would have taken us far longer to do as our first time using an ESB. That being said, should we encounter this with another vendor, it would make sense to review using an ESB for this type of functionality in the future.