It was a sunny day in June 2005 and our spirits were high as we watched the new ordering system we'd worked on for the past 2 years go live in our production environment. Our partners began sending us orders and our monitoring system showed us that everything looked good. After an hour or so, our COO sent out an email to our strategic partners letting them know that they should send their orders to the new system. 5 minutes later, one server went down. A minute after that, 2 more went down. Partners started calling in. We knew that we wouldn't be seeing any of that sun for a while.
The system that was supposed to increase the profitability of orders from strategic partners crumbled. The then seething COO emailed the strategic partners again, this time to ask them to return to the old system. The weird thing was that although we had servers to spare, just a few orders from a strategic customer could bring a server to its knees. The system could scale to large numbers of regular partners, but couldn't handle even a few strategic partners.
This is the story of what we did wrong, what we did to fix it, and how it all worked out.
"Best Practices" Aren't Enough
Although we had designed our system using the documented best practices of the various vendors, using stateless request handling logic, layered architecture, tiered deployment, and separate OLTP and OLAP databases, nobody ever told us about the different kinds of scale our system would have to handle. During 2003 as we were designing the performance critical aspects of the system, and through 2004 as we went through the load and stress testing, we were positive that we had everything covered.
After sifting through the server logs and monitoring system events we found out that orders from strategic partners were very different from those of regular partners. Where regular partners would order several hundred items at a time, strategic customers were sending us hundreds of thousands lines at time. Requests could be hundreds of megabytes. Neither our messaging infrastructure nor our object/relational mapping code had been tested under those kinds of loads. Server cores burned hotter than ever trying to deserialize all that XML. Handling a single request could swallow half a gig of memory. Database locks were held for minutes instead of milliseconds. As threads were timing out, the garbage collector would go crazy to reclaim all that memory, hurting system availability even more.
The first thing we did was replicate those scenarios in our performance testing lab. Each time we ran the tests and the system crashed, we just looked on in disbelief. I kept thinking to myself, "we did everything by the book - how is this happening?"
Truth be told, it was the first company I worked for that actually gave us the time and the budget to do everything by the book. We didn't have any excuses. But what do you do when the book's just not enough?
Different Kinds of Scalability
It turns out that requests per second is just one dimension of scalability. The other dimensions that we painfully found out about were:
- Message Size
- CPU utilization per request
- Memory utilization per request
- IO (and network) utilization per request
- Total processing time per request
Message size seemed to be have a large influence on all the other dimensions. As messages got larger, it took more CPU to deserialize them, more memory to hold the resulting data, more network and IO to get that data in and out of the database, altogether affecting the total processing time. However, even small requests like those to discount all pending orders for a partner were impacted by the amount of data they had to process.
We went over everything, but there was just no getting around it. Unless we could make those big messages smaller, the problems wouldn't go away. Here a snippet of the kinds of conversations we had:
Dan: "Binary serialization might work for the smaller numbers of strategic partners."
Barry: "No good, there's a total of 5 non-compatible platforms between them."
Sasha: "And that wouldn't do much for memory or IO either."
Me: "What about compression? That'll ease the load on the messaging infrastructure."
Dan: "That'll just weigh heavier on the CPU's."
Sasha: "Do I have to repeat the memory and IO thing again?"
Barry: "It's like Request/Response won't work here."
Me: "You know how much I like Publish/Subscribe, but I don't see how that can work."
But as we began delving deeper into the core messaging patterns, we stumbled on the solution.
The Real World is Message Oriented
What surprised us most about the solution was that it suited both regular and strategic partners and greatly improved performance for both. Not only that, but it enabled faster turnaround time for each order and improved inventory management capabilities, things we hadn't even considered at the time.
The solution was quite simple really - instead of one "create order message", partners could send us many "order messages" over time keyed by their partner id and purchase order number. When they were done with all the items for that purchase order number, they would send us an "order message" with the flag "complete" set to true. It was a statefull interaction.
You see, partners almost always had a procurement department that sent out orders. Those orders were built up over time until they were "complete", and then sent to us. Our solution involved the partners procurement system send us partial, non-final order information as it was built up. They could change information that had already been sent out, cancelling parts of the order, all without knowing the order number in our system (managed by an existing ERP). In fact, until we received a message indicating the order was complete, we didn't even call the ERP for that purchase order.
When we received an "order message" we would respond with an "order status changed message". If their system didn't received a response in what they considered a reasonable period of time, they could just resend the previous message. In other words, we made our messages idempotent. This meant that partners had to send all lines for a given product SKU again (containing the various options and configurations) anytime they wanted to do anything with that product SKU - actually not that much data.
Idempotent messages are those which, if handled multiple times by a system, have the same effect as if they were handled just once.
This had a performance critical side effect - we no longer had to use durable messaging so that messages wouldn't get lost. Instead of writing enormous messages to disk all the time, our application protocol set it up so that the partner system would manage that state for us - a truly trivial amount of additional complexity at their end.
Scalability Effects of Changed Schema
As we collaborated with our strategic partners on the new "version" of the ordering system (nobody dared called it a rewrite), we found out that the expected size of the new "order messages" was on the order of a couple thousand line items - similar to those of regular partners. As message size decreased, we saw that the other dimensions of scalability improved as well - CPU, IO, and memory utilization all dropped as well.
Once resource utilization per request dropped, obviously latency dropped as well but throughput increased even higher than expected. The thing was that the gigantic messages were "hogging" resources used by smaller requests. Over time, the entire database connection pool was held by threads processing big messages effectively "denial-of-servicing" threads servicing small messages causing them to timeout.
There were other parts of the system which were still affected by increased data volumes and decreased message size didn't solve. For instance, discounting pending orders for a partner and requests like it just had to implemented without using object/relational design. That request handling logic was better expressed using set-based logic - straight SQL. Instead of loading all data into memory, iterating over it and changing it in some way, and finally saving it all back to the database, a simple SQL statement could be used:
UPDATE PendingOrders SET Discount=@Discount WHERE PartnerId = @PartnerId
The effect was striking - faster response times and increased stability.
Explicit State Management Scales
One of the most notable differences between the design of the new version and the previous version was that message handling logic was "statefull", something all the vendors had been warning against. Martin Fowler's writings helped us understand that keeping state in the database doesn't magically make a system scale - it only makes the database a bottleneck, and the database vendor sell more licenses. Our new design handled the state of the multi-message processing explicitly; we used the term "saga" to describe the combination of logic and state. Explicit state management enabled us to choose the most appropriate technology for its storage.
For our regular partners, we had many small objects being written and read at high speed, but the lifetime of those objects was rather short so we decided to use an in-memory, distributed cache product that made sure state was highly available. For our strategic partners, state could grow to hundreds of megabytes and would be evolving slowly over a period of weeks and even months. There we ended up using a database, but it was an open-source database mapped to direct access storage rather than the SAN used for the OLTP database. Stored order data didn't need to be stored as XML giving us the benefits of binary serialization but without comprising interoperability.
Sagas - Benefits and Challenges
Since the new message contract allowed our partners to send us many messages with the same purchase order number, our system needed to be able to differentiate which message related to an existing order-processing saga and which required a new saga to be created. As such, we needed a way to query our saga persistence mechanism by partner id and purchase order number. This requirement eliminated several popular distributed cache technologies as they only allowed querying by id, yet several higher-end solutions worked just fine.
The term "saga" was coined in 1987 by the relational database community to describe a style of handling long-lived transactions. Sagas gave up on global atomicity and isolation, treating the process as a sequence of multiple, smaller ACID transactions.
While some on our team were originally apprehensive about moving to a new programming model, they quickly saw it was practically the same as regular message processing. When a message came in, it's data was used to look up an object (saga) from some storage, a method was called on that object, the object changed some of its state, some messages were sent, and the object was sent back to the storage. The only difference was that the state the saga managed wasn't so much about data in the master database or the ERP but rather the temporary data of the interaction. When the saga received an order message with the "complete" flag set to true, it would call the ERP system entering all the data it had accumulated and send a message back to the partner system notifying them that the status of their order was now "accepted" and not just "received".
As development progressed, we realized that instead of only replying to the partner system when we received an "order message" from them that other systems in the company would be interested in knowing about the order even before it was complete so we started publishing "order status changed messages" to anyone interested.
Publish/Subscribe Increases Enterprise-wide Performance
The first department that took interest was Fulfilment - specifically in the "rush jobs". One of the challenges the company had been facing recently was providing better service to clients requesting rush jobs, and the problem ended up being fulfilment. You see, even if we were able to get all the products ready on time, having the right kind of transportation ready and in the right quantities almost never happened. Some products needed to be refrigerated, other needed to be bubble wrapped, you get the idea. Just having enough staff on hand to handle things was problematic.
Now that the fulfilment systems knew in advance by which date orders would need to be provided and what products they would entail, they could plan in advance to have enough staff on hand, enough trucks with refrigeration, everything they needed to get those jobs out the door on time.
Inventory management followed fulfilment's lead - knowing that an order was coming for certain products enabled them to get inventory ready in advance, pressuring suppliers to deliver earlier when necessary.
Load Testing Results
After putting the new system through its paces using the messages we recorded from the previously unsuccessful launch, we found some interesting things.
First of all, the first time we ran the tests our test engineers had configured the messaging infrastructure incorrectly allowing messages to be handled out of order. Although we were pleased with the performance, we went through all the logs and data checking if everything was actually correct. Although we knew idempotent messages could be handled any number of times, we'd never considered the issue of ordering before. Just to make sure, we had another round of code reviews looking specifically for issues around message ordering. Finally, we did some more functional testing before we were confident that we could loosen the ordering constraint.
Second, we found that response times slowly degraded over the period of testing even though the tests were just repeating the same scenarios over and over again. After noticing that the CPU utilization of those later scenarios was the same as the earlier ones, our attention turned to locking issues in the database. Those were quickly ruled out when one of us noticed that the saga table had more than a million rows. As we looked at each other puzzled, someone mumbled "did we say we'd be deleting sagas when they're complete?" And that did it. One small code change to the saga persistence mechanism and response times levelled nicely.
A Partner's Perspective
Of course, in order for the system to work our partners needed to change their systems as well. You can imagine that before we got a green light to go forward with our design, many partners were consulted both at the technical level and the COO level. Our regular partners were a bit irked by the change but made the "create order message" to "order message" change and left it at that.
After the technical teams at our strategic partners had a look at our demo code, we saw their faces light up. "You didn't think we liked sending those enormous messages, did you? This is going to save on memory and CPU-bound serialization on our end too." They even were willing to open up a port in their firewalls to have the status of their orders pushed in as the orders moved through various phases in our company.
The move to production was a bit touchy though. Our partners had to set it up so that both the old and new versions of their system were online at the same time as we brought our system up. Our COO was understandably on edge throughout the process, which, despite some small hiccups, was a success.
We watched all the previous health indicators - CPU, memory, IO, database locks for several hours straight before letting out our collective breath. In the days and weeks that followed both the operations and development staff monitored these stats looking for signs of impending doom, but none came. Response times were twice as fast as what we'd seen the last time around. 3 months after that came the process of decommissioning the previous order system.
Nearly 2 years had gone by since that original summer day in 2005, and the clouds over our heads had finally dissipated. The COO didn't particularly care about all the technically wondrous capabilities of the system, he was too focused on the increased profits from our strategic partners.
Lessons Learned
This project was indeed a trial by fire for many of the best practices of the day. We had blindly put our faith in the technologies and products the vendors had supplied, and were burned. Although performance and scalability were always kept in mind, we hadn't considered the deeper system effects of large data volumes. As a part of our overall agile mindset, we had embraced the only partial truth of "Premature optimization is the root of all evil". The full quote states:
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
-- Sir Tony Hoare
These were not "small efficiencies" nor was the subsequent redesign anything like "optimization". By changing our service contracts and introducing statefull interactions we were able to manage the performance critical state of the system. We'd never known that that level of control was possible, and the specific technology choices not only made for a performant system, but a cost effective solution.
The key takeaway for me was this: scalability isn't a Boolean value. Beyond the number of concurrent users or servers online, scalability is a multi-dimensional cost function. Given certain response time requirements, peak and average request rates, message sizes, formats, memory working set sizes per request, CPU/IO utilization per request, how much does a given solution cost? Technology choices which made sense for strategic partners were not cost effective for regular partners. Always run the results by the business - they just might change the performance requirements once they see that costs (up-front and ongoing) overshadow projected revenue.
References
- Designing and Unit Testing Sagas
- Sagas - Garcia-Molina, Salem 87
- Idempotent Messaging
- http://www.nServiceBus.com - infrastructure supporting sagas.
About the Author
Udi Dahan is The Software Simplist, recognized by Microsoft Corporation with the coveted Most Valuable Professional award for Solutions Architecture now 3 years running. Udi is a Connected Technologies Advisor working with Microsoft on WCF, WF, and Oslo. He provides clients all over the world with training, mentoring and high-end architecture consulting services, specializing in Service-Oriented, scalable and secure .NET architecture design.
Udi is a member of the European Speakers Bureau of the International .NET Association, an author and trainer for the International Association of Software Architects and a founding member of its Architect Training Committee. He is also a Dr. Dobb's sponsored expert on Web Services, SOA, & XML.
Presenting at international conferences like TechEd USA, SD Best Practices, DevTeach Canada, Prio Germany, Oredev Sweden, TechEd Barcelona, QCon London, and TechEd Israel - Udi covers in-depth mission-critical topics like durable services, persistent domain models, and multi-threaded occasionally connected clients.
Udi can be contacted via his blog www.UdiDahan.com.