Key Takeaways
- Understanding the impact of distributing processes and data helps teams make better MVA decisions without over-investing in capabilities they don’t yet need.
- The cloud does not make distribution problems go away. It can make them harder to solve because it hides the underlying infrastructure.
- Changing the location of data can have subtle impacts on application logic that are hard to spot and harder to debug. Paying special attention to how date and time stamps are handled is worth consideration as part of the MVA design.
- Microservice architectures are great for modularity but make data aggregation more challenging.
- High-volume cross-service communication considerations may mean that certain services, and their data, need to be co-located to achieve performance and responsiveness requirements.
Working on Cloud applications can easily convince developers that the location of resources no longer matters, and as long as all the resources you need are in the cloud, this is largely true.
But throw a mobile application in the picture, especially an application relying on data residing in legacy data stores, and all of a sudden the location of resources, including data, matters intensely.
And since mobile applications are increasingly the preferred way people interact with software systems, resource location is something that developers must always keep in mind.
In previous articles in this series, we have introduced the Minimum Viable Architecture (MVA) concept, as well as described how MVA changes how you think about using architectural frameworks, patterns, and tactics.
In this article, we explore patterns and tactics related to distributing computing workloads and related data, while discussing the issues that MVAs have to consider when distribution comes into play (and it almost always does.)
One of the most frustrating problems for both users and developers can be summarized in one phrase: “it runs fine on my machine…” Sometimes these issues are related to machine configuration, where the developer has one hardware/software configuration and the user has a different one. These problems are, on the whole, relatively easy to diagnose because configurations are static and mismatches are easily found by comparing the two environments.
More pernicious, and often intermittent, are problems relating to the distribution of either application logic or data. The reason for this is that distribution problems often show up only under higher load factors, so unless the developer can simulate real-world processing loads in their environment, they won’t see the problem. Fiber-optic network speeds can hide many distribution problems for a long time, until the network’s capacity is saturated and an application’s performance becomes more sensitive to the architectural decisions made, often implicitly, by developers. Under these circumstances, it is easy for developers to blame the network instead of their own inattention to distribution issues.
What’s the big deal about distribution?
In our age of fiber optic networks, and globally distributed cloud-based data and processing, haven’t we moved beyond being worried about where applications run and where data lives? In a word, no. Problems with distribution occur even when data is running in the same data center as the applications that use it; global distribution only exacerbates the problem. Let’s consider two different ways distribution affects the architecture of the application: distributing application logic (code) and distributing data.
Distributing MVA application logic
Today’s applications are highly portable, meaning they can be relatively easily moved from one computing environment to another, either by using portable languages or by using virtual machines or containers. So why is moving where the code runs an architectural concern?
Even when the application code is portable, and even when containers hide the underlying computing environment, vestiges of the underlying physical machine can still trip up the unwary. One simple example is a problem with timestamps, which are typically set by the application based on settings on the underlying hardware. If one application is running in Asia and another in North America, the Asian application can create a timestamp for a date and time that is in the future from the perspective of the application running in North America, since the Asian application and the North American application are located on different sides of the International Date Line. This can lead to faults and failures that can crash the entire application or produce strange results in time-dependent calculations such as interest on overnight bank funds. A similar problem can occur when timestamps are set by using database server DATE functions and those servers are located in different time zones from the application since the date of record will be determined based on the server’s location.
The problem is even harder to spot if the applications are composed of microservices that can be load-balanced to any available server, anywhere in the world. In this case, the time zone used in timestamps is not easily predictable.
One solution to this is to use a globally agreed time that is the same for everyone no matter the location (as mariners do by using Coordinated Universal Time (UTC) in celestial navigation). Deciding to create and use such a service is an important architectural decision. Using a UTC does not solve all problems related to establishing common date/time references, but it’s a good start. Some of the remaining concerns include whether date/time stamps actually need a time component (not all applications do and, for some, including it is confusing), how dates and times should be displayed on screens and reports (should it be the local date/time or the UTC date/time?), and so forth. You’ll still have important issues to resolve, but at least you can make these decisions based on a common understanding of the date/time of record.
A more subtle problem results from inter-service communication. When all services run in the same physical environment, the “cost” of inter-service communication in terms of elapsed time is very low; in other words, communication latency is low. If those services are moved so that they are now no longer on the same machine, but quite possibly not even in the same part of the world, communication latency can jump unpredictably as the service call may have to traverse networks, bridges, and routers, each of which adds to the round-trip time. As with timestamps, load balancers can exacerbate the problem when they try to balance compute load and inadvertently increase communication latency.
Distributing MVA Data
As noted in the Software Architecture and Design InfoQ Trends Report—April 2022
Data plus architecture is the idea that, more frequently, software architecture is adapting to consider data. [...] We’re seeing a transition from data being considered only at the storage or transport layers in a system to data being a defining element of the system.
The location of data is a key consideration of most MVAs; distributing application logic while keeping data centralized, perhaps because most of the data required for the MVP is currently located in a centralized legacy data store, will most likely create latency and throughput problems such that the system may struggle to meet some its Quality Attribute Requirements (QARs), such as performance or scalability.
As we explained in a previous article, a team's architectural decisions as it develops the MVA focus on how the product will satisfy its QARs. Persistency gives rise to many of the most important QARs, specifically those that are related to how the product stores and retrieves data.
To meet these QARs, a team will have to make decisions about the characteristics of the data, including its structure or lack thereof. They also have to select appropriate data storage technologies (e.g. SQL DBMS, NoSQL DBMS, etc.). These decisions almost always involve decisions about where the data will be located, at least relative to the location of application code (e.g. on the same server, on a different server in the same data center, in different data centers, including those in different time zones, or in a commercial cloud, and therefore without a fixed or known location).
In many ways, distributing data can be imagined to be the same as distributing processing, with an important difference: the size of the messages returned by the remote service calls that return data are potentially large enough to deserve special consideration. For example, consider an application that queries a database on a remote server. Queries that return lots of rows of data that are further analyzed in the application may need to be reconsidered; passing a lot of data over a network, no matter how fast, is inefficient. A better approach is to use appropriately designed views, stored procedures, or remote services located on the same machine as the database to do as much processing at the same location as the data as possible, so as to reduce the resulting network traffic. Doing so will reduce latency and unnecessary processing of information, greatly improving performance.
Eliminating unnecessary data transfers can also have a benefit for the environment. By eliminating unnecessary processing, the carbon footprint of an application can be greatly reduced. The principles of green software engineering, some of which consider the location of your data and processing, not simply for efficient applications, which can reduce carbon impact, but also in the context of how green are the data centers you use, are worth considering.
In some cases, the choice of where the data lives may not be open to the team; some data may already exist in a legacy data store, but they will still have choices to make for the new data they need to create and store. And they will have to address concerns associated with data access latency across these different sources as they need to aggregate new and legacy data when answering inquiries, providing analytics, and preparing reports.
A microservice-based architecture can also create some data-related concerns. In the simplest view, each microservice will have its own data store. If the microservice and its clients and its data are all distributed, performance may suffer due to constraints such as network latency bandwidth.
Consider the case of the simple join operation in SQL, which typically takes place on a single server and returns a set of data from one or more entities (tables). If those entities are microservices, then the equivalent to that single-server join means iterating over multiple microservices to pull all the related data together. If those microservices are distributed, call overhead and latency will be substantially worse than in the SQL database case. The “one datastore per microservice” approach is great from a decoupling standpoint, but unfortunately, it loses some advantages of relational databases, which make aggregating data a relatively simple task. As with many architectural decisions, a trade-off between loose coupling, performance, and integrability is needed.
For example, it may make sense to group microservices with similar responsibilities within the same bounded domain, and assign ownership of a data store to each group of microservices (sometimes referred to as a “component”), in order to alleviate performance and data integration issues. In addition, leveraging a data mesh approach that treats data as a ready-to-use, reliable product is an effective way to organize data by bounded domains to ensure that data and associated processing are distributed in the same fashion.
Another alternative might be to use separate databases for all cross-service reporting and only use the data stores owned by the services for the transactional workload. This approach involves capturing data first, then deciding how to analyze it, and is sometimes referred to as “schema on read vs. schema on write”. The reporting databases can be updated asynchronously and at low priority if close-of-business reporting rather than real-time is acceptable to the system stakeholders. This would be suitable for software systems that do not require real-time analytics such as commercial insurance systems but not for securities trading systems or bank cash desk support applications.
Regardless of the design and distribution of the MVA data stores, you should try to locate the processing as close as possible to the location of the data. For similar reasons, data that tend to be accessed at the same time should be co-located to avoid network traffic and latency overhead.
For example, if you use several serverless functions hosted in a commercial cloud as part of your mobile application’s MVA, you may be challenged to meet performance QARs. Serverless functions that need to frequently access data stored in your company’s data center are going to need a very fast network connection between your company’s data center and the data center where the serverless functions are hosted in order to provide a fast response to the mobile user - which is highly unlikely. It would be more efficient to either move the serverless functions in-house or move the data to the commercial cloud.
What distribution decisions should the MVA consider?
The concerns we have sketched can be expressed in terms of key questions that a team should answer when considering its MVA:
- Can applications or services be relocated, or must they run in a particular environment?
- Can data be relocated dynamically, or does it have to reside in a particular data store location? Some countries, for example, impose legal requirements dictating that data about its citizens cannot be stored outside the country's boundaries. Or there may be technical reasons why data cannot be relocated, such as application co-residency requirements. Constraints like these mean that you may be forced to accept a greater degree of distribution than you would ideally choose.
- Do certain services or applications have to be co-resident with other services, or with particular data stores?
- If load balancers automatically move data or processes, will QARs be affected? Generally speaking, how load balancers work often affects how the application will meet QARs, and so decisions related to load balancing tend to be architecturally significant.
This is not an exhaustive list; a team can form its own questions by considering how data and application processes interact, and how these interactions may affect the system’s ability to meet its QARs.
Conclusion
It’s easy to assume that adopting cloud technologies removed the distribution of processes and data from the list of concerns that a team needed to worry about, but in some ways, it made the problems harder because it’s more difficult to see what is really happening in the cloud. The cloud lulls the team into thinking that computing resources are composed of one vast homogenous pool, but in reality, there is always physical hardware and software running just below the surface, like hidden shoals through which the team must navigate. Considering data and process distribution issues will help them find their way.