Max Indelicato, a Software Development Director and former Chief Software Architect, has written a post on how to design a web application for scalability. He suggests choosing the right deploying and storage solution, a scalable data storage and schema, and using abstraction layers.
The Right Tool for the Job
Indelicato’s first advice is to choose “the right tool for the job” by selecting one of the following architectural solutions:
- Using a cloud deploying solution
- Using a scalable data storage solution such as MongoDB, CouchDB, Cassandra, or Redis
- Adding a caching layer like Memcached
This reporter considers that none of these solutions is mandatory from the start of the application, but it is wise to choose a scalable data storage solution from the beginning to avoid a switch later on. Deploying to the cloud brings some advantages especially for startups which cannot accurately determine the usage of their application after its launch. Deploying to the cloud would allow the application to scale gracefully if the need arises. Many software architects have told the story of how their application had to grow and they introduced a caching layer, solving a good part of the problem. But the solution must not necessarily be considered from the design phase. It can be easily implemented along the road.
A Scalable Data Storage
Indelicato continues by suggesting to choose a data storage that supports partitioning, replication and is elastic, one of the following: MongoDB, Cassandra, Redis, Tokyo Cabinet, Project Voldemort, or MySQL for a relational DB. This would be desirable because partitioning is necessary anyway over the life of the application. Replication is not necessary for scalability reasons but for “ensuring a high level of availability”. Elasticity is good to quickly add more nodes when peak traffic is encountered, but also when “maintenance is required on a node as a result of a hardware failure or upgrade, a large scale schema change, or any number of reasons that a node might require downtime.”
A Scalable Data Schema
Indelicato suggests creating a schema that easily allows data sharding, giving as example the following casual components, an User and an UserFeedEntry:
Collection (or Table, or Entries, etc) User { UserId : guid, unique, key Username : string PasswordHash : string LastModified : timestamp Created : timestamp }
Collection (or Table, or Entries, etc) UserFeedEntry { UserFeedEntryId : guid, unique, key UserId : guid, unique, foreign key Body : string LastModified : timestamp Created : timestamp }
And continues by suggesting to partition on UserId:
By partitioning on the UserId field, in both the User Collection and the UserFeedEntry Collection, we’ll be clumping the two related data chunks together on the same node. All UserFeedEntry entries with a UserId of xxx-xxx-xxx-xxx will be contained on the same shard as the User entry with a UserId value of xxx-xxx-xxx-xxx.
Why is this scalable? Because our requirement for this application is perfect for this distribution of data. As each visitor visits a User’s profile page, a request will be made to a single shard to retrieve a User to display that user’s details and then a second request will be made to that same shard to retrieve that user’s UserFeedEntries. Two requests, one for a single row and another for a number of rows all contained on the same shard. Assuming most user’s profile gets hit about the same amount throughout the day, we’ve designed a scalable schema that supports our web application’s requirements.
Using Abstraction Layers
Indelicato’s last suggestion is to use the following abstraction layers amongst others: Repository, Caching, and Service. When creating the Repository layer, he recommends to:
- Don’t name methods in a manner particular to the data storage you’re abstracting. For example, if you’re abstracting relational storage, its common to see Select(), Insert(), Delete(), Update() functions defined for performing SQL queries and commands. Don’t do this. Instead, name your functions something less specific like Fetch(), Put(), Delete(), and Replace(). This will ensure you are more closely following the Repository Patterns intent and make life easier if you need to switch out the underlying storage.
- Use Interfaces (or abstract classes, etc) if possible. Pass these interfaces into higher layers of the application so that you’re never actually directly referencing a specific concrete implementation of the Repository. This is great for building for unit testability too because you can write alternate concrete implementations that are pre-filled with data for test cases.
- Wrap all of the storage specific code in a class (or module, etc) that the actual Repositories reference or inherit from. Only put the necessary specifics of an accessor function in each function (query text, etc).
- Always remember that not all Repositories need to abstract the same data storage solution. You can always have the Users stored in MySQL and the UserFeedEntries stored in MongoDB if you wish, and Repositories should be implemented in such a way that they support needing to do this down the road without much overhead. The previous three points indirectly help with this as well.
For the Caching layer Indelicato says that he often starts with a “simple page (or View, etc) level caching or Service Layer caching as these are two areas where it’s not uncommon to see state change infrequently.”
Indelicato considers that a Service layer needs to have enough abstraction so one can easily switch the internal implementation of the service with an out-of-process one when the need arises.
Some consider that an application can be built without worrying about scalability issues, because those can be addressed when it’s necessary. But if one is to consider scalability it from the beginning, what other suggestions can be added?