In this article, I'll start by considering some aspects of the enterprise architecture of financial systems and compare them to some characteristics of gaming environments that I've observed as a player.
In the second half, I'll go on to discuss some of the technology and best practices that have grown up in the development of cloud deployed architectures. Finally, from these case studies, I'll look into my crystal ball and imagine some of the gaming possibilities that the synthesis of all of these techniques could unlock.
First, a caveat about financial institutions. The larger firms are huge and can comprise dozens of business lines and a truly bewildering array of systems, of many different types and widely varying non-functional characteristics. So some of the manner in which I'm going to describe them is, of necessity, an oversimplification. It's possible for an engineer to spend his or her entire career in an investment bank and still never work on more than a fraction of the systems the bank has.
So when I discuss financial institutions and their systems in this article I am really focusing on the client-facing parts of investment banks. These groups and projects are usually very concerned with the reliability and stability of their systems. This is very much in their nature - the banks are heavily regulated players in a highly competitive and lucrative market.
A typical example of such a system might be a client order management system. This would accept orders for stocks and shares (or commodities or foreign currencies or other financial instruments) on behalf of clients and place these orders directly on an electronic market, with basically no manual intervention by bank staff under normal conditions.
Bank clients who use such a system usually have no loyalty to a particular bank. Many clients will have comparable client accounts with several different banks to provide market access.
If the order management system is ever unavailable for any length of time (e.g. for a few seconds or even less) then the customers will simply switch to a competitor bank to fill their orders and they may not return to the original provider for months. This is true even at the busiest periods of the market.
This means that these types of banking systems must be engineered for very high reliability, as their customer base is extremely fickle and a single lost customer can seriously impact the profit a division of the bank can make.
Experience in the market has enabled banks to achieve these levels of reliability, but it has come at a significant cost. This cost is both in terms of software and hardware to ensure redundancy and monitor the system, and also in terms of the number of support engineers required to keep the systems running at the required levels of reliability.
By contrast, when gamers have formed a major emotional attachment to a particular game they can be much more tolerant of outages. For popular games which deploy regular, largeish patches (often a few hundred MB in size) potentially slow download times seem to be mostly accepted by users - and no mass exodus to another game occurs.
Even the occasional crash of a server seems to be regarded as a fact of life. As long as it doesn't happen too often gamers seem to regard crashes and even the loss of a small amount of game state and experience as acceptable.
Another clear difference between banking and gaming systems arises from the difference in user impact patterns. No matter how hardcore the gamer their overall impact on a system, and consumption of system resources, will be limited.
In banking certain clients are much more important than others - and frequently the important "whale" clients will have the capability to consume significant amounts of a system's capability and processing power.
This leads to a situation where the sharding pattern naturally works for games because the individual gamers can be efficiently divided into roughly equal piles. In banking this can be far less applicable or require significantly more work to implement in a useful manner.
One last comparison between banking and gaming tech - one area where both have seen significant work in optimisation is around networking stacks. Latency and bandwidth in particular are issues that are potentially very relevant for both gaming and banking.
Since leaving finance I've become involved with some interesting cloud-based startup projects and seen first-hand some fascinating technologies and practices emerging - some of which seem relevant to the ways in which gaming infrastructure could potentially evolve.
The architectures that we want to build in the cloud should have three main non-functional aspects along with basic fitness for purpose and actually performing the tasks required of them:
- Redundancy - the architecture should be able to withstand the loss of any individual server. In more advanced use cases, the loss of an entire data centre (or even a whole IAAS region) should not cause service degradation.
- Recoverability - the system should automatically recover to a good state when a transient outage is over.
- Reproducibility - the system should have sufficient logging and monitoring that after an outage has occurred, the problem can be reproduced, analysed for a root cause and then fixed so it can't recur.
With these capabilities in mind, I sometimes find it helpful to regard the evolution of cloud technology and best practices to date as being composed of two distinct, but overlapping, phases.
The first is the transition from managed hosting to Infrastructure-as-a-Service (IAAS), characterized by the development of services which offer APIs for provisioning and command and control. Without the presence of such interfaces, it's a real stretch to consider a solution "Cloud" in any meaningful sense.
In addition to the availability of provisioning APIs, the other technology that I regard as being typical of the first phase is the capability to relocate virtual instances to different physical hardware in a manner transparent to the user of the virtual instance.
The combination of these two capabilities - provisioning API and transparent relocation - starts to provide some of the potential benefits that the cloud offers. These benefits are usually stated in terms of elasticity of scaling, compute as a commodity to be purchased by the hour and potentially greater reliability.
The second phase is perhaps best characterized by the phrase "Servers are livestock, not pets". Traditionally, systems administrators hand-built servers to order. In such an environment, it is very difficult (even with scripts and hand-rolled automation) to ensure that two servers are built in precisely the same way.
Worse, even if the servers have been built identically, the problem of verifying this fact remains. Over time this problem only gets worse, as servers are individually upgraded and personally cared for by the sysadmins. If an important server starts to have serious problems it is often nursed back to health like a beloved family pet.
The second age of Cloud really started with the rise of techniques like Continuous Deployment and the Devops movement. Technologies such as Puppet and Chef allow the automated building of uniform servers from scratch, in a way which emphasizes rebuild and redeploy over extensive manual patching. This is the basis of the approach which tends not to value an individual instance very highly, treating them effectively as livestock.
Interestingly, the financial industry had long had a need to deploy large numbers of servers and to be insouciant in the event of any particular server dying. Morgan Stanley are one of the very few investment banks to speak relatively openly about aspects of their infrastructure. They are on record as having tens of thousands of Unix servers across over 30 locations as early as 1995 (Gittler, Moore and Rambhaskar, LISA 95) - and this was to grow to hundreds of thousands of machines over time.
However, despite the existence of capable infrastructure technology almost 20 years ago, these techniques did not become widespread until relatively recently for two reasons:
- The technology was purely proprietary and in many cases, rather tightly bound to a specific company's problem domain.
- Few companies really had a need to manage and orchestrate that much infrastructure.
The proprietary technologies that highly capable banks developed did provide a foreshadowing of modern large-scale techniques, and so it is no surprise that when companies such as Google began to appear, they used those banks as a primary source of talent.
The development of open-source configuration and management solutions such as Chef and Puppet were to prove key to this second phase of cloud techniques, arriving as they did at a time when more and more companies were discovering the potential opportunities that cheap large-scale compute offers.
Looking into the future, containerization is one obvious next step which is starting to emerge. The idea is to ship a self-contained application deployment unit which prevents Dependency Hell and which is fully functional when deployed onto a basic application host.
The first viable product which enables this is Docker, which makes use of Linux Containers (LXC) to provide isolated application environments running on a union mount filesystem. Docker has ambitious aims, but is still really quite immature and should not be deployed in production by teams who can't cope with some rough edges. However, Docker has a decent sized (and growing) community and support from several major vendors, including Red Hat and Google.
Docker may yet succeed as the dominant technology in this space - or additional credible equivalent competitive products may appear (which would be similar to what happened with Devops and configuration management as additional tooling choices began to emerge).
However, whatever the competitive landscape ends up looking like, the idea of containerization as a deployment method is compelling. For teams which have adopted it, there are clear benefits in terms of thinking about architecture and application packaging.
Finally, let's turn to the question of to what extent the deployment of cloud techniques can point the way to more efficient and reliable gaming infrastructure.
There are two main benefits that could derive from better game infrastructure: Sharply reduced game running costs for game developers, and more reliable infrastructure and less sharding.
The economics of the cloud work in favour of game producers - because they alleviate upfront costs - there's no need to build datacentres which might sit idle for a long time if a game fails to immediately take off.
The benefit of this to players is immense. If a major cost component of games (potentially as much as 10% of the operational costs of running a game) can be reduced and made much more scalable, then this opens up the market for more indie games, more appetite for risk in the AAA space and hopefully a wider range of gaming experiences.
The reliability techniques that have long been a part of banking architecture can also play a role here, in preventing downtime and reducing the impact of sharding on the overall gaming experience.
About the Author
Ben Evans is the CEO of jClarity, a Java/JVM performance analysis startup. In his spare time he is one of the leaders of the London Java Community and holds a seat on the Java Community Process Executive Committee. His previous projects include performance testing the Google IPO, financial trading systems, writing award-winning websites for some of the biggest films of the 90s, and others.