In his talk at QCon, Mark Nottingham, a “Principal Technical Yahoo!”, provided some insight into how the Yahoo! Media Group uses the Web, and not Web services, to build its SOA variant.
As can be expected from a Web company of Yahoo!’s size, the numbers are impressive: There are about 4 billion daily page views, up from about 65 million in 1996. Although Yahoo! presents a unified appearance to the outside, it has a diverse environment internally. Integrating the different “properties” (as the Yahoo! calls its offerings) can quickly become a nightmare — acquisitions, integration with partners, and also with older Yahoo! infrastructure pieces poses a significant challenge. And this problem is intensified by Yahoo!’s own SlashDot effect: A link on Yahoo! home page will trigger an enormous load on any of the existing applications.
The initial architecture at Y! Media Group for most of the properties consisted of independent front end boxes, including databases, with a master database in the backend. When one of the properties needed to be expanded due to higher demand, this led to a number of problems because “large datasets don’t push well”: If there was a need to extend the News property with 50 machines, all of them had to be initialized with the appropriate content from the backend master database. Adding to these problems are issues of synchronization after a failure, the fact that more and more content is generated by the users, and the need for “cross property integration”. Mark gave Yahoo! Tech as an example, which integrates products and answers from Yahoo! Answers. Another new property burdened with this problem is Yahoo! Pipes. In the old architecture, the need for cross-property access was solved by one “frontend box” requesting data from another frontend box, intensifying the problems.
The requirements for an improved architecture were thus pretty obvious: massive scalability, flexible deployment, highly dynamic, separation of concerns. As a result, Y! Media Group decided to move towards a Service-oriented architecture. Because “Webby” solutions such as PHP have always been more prevalent than “enterprise” technologies such as Java, and because it was felt that scalability, simplicity, reuse, and interoperability are better addressed this way, the decision was made to use a REST/HTTP-based solution instead of one relying on Web services and the WS-* stack.
Instead of replicating data between a backend master database and the frontend database, the frontend boxes now issue requests through a cache to backend API servers, all via HTTP. Because of this, there is now a single source of truth. The cache replicates the data once it has been requested - a pull model vs. a push model. Questioned whether this is a RESTful API, Mark stressed that he views issues around REST as a philosophical discussion, but conceded that the backend APIs are, in fact, RESTful. (He has expressed this view before in a blog entry called “REST issues: Real and Imagined”) User generated content is pushed through to the backend, and adding capacity becomes easy.
As one example of just using HTTP correctly instead of getting into a philosophical REST discussion gave caching intermediaries. The caching features built into HTTP are quite advanced, and they become immediately usable for well-designed HTTP applications. Examples of advantages are freshness (because the data is pulled from the backend whenever it needs to) and validation (asking “has this changed” is a quick HTTP-base question to the backend). It is also possible to provide “recalculated” results, which are validated against the etag of the calculation input. Having a standards-based cache also enables the collection of metrics and load balancing. (For a great introduction into HTTP caching, see Mark’s own Caching Tutorial for Web authors and Webmasters.)
Mark also commented on some more advanced techniques used at Yahoo! Media Group. Multi-GB memory caches are not at all uncommon, and sometimes they are put into groups that are kept in sync via cache peering, i.e. the synchronization of more than cache in a group. (There are numerous common cache peer protocols, such as ICP) Another advanced concept is negative caching: if there’s an error out of the API server, the cache will cache the error, reducing the load on the backend. Collapsed forwarding means that multiple requests from the frontend can be collapsed to a single one, which according to mark is another great way to mitigate traffic overload from the frontend. While the cache is refreshing something in the backend, it can return a stale copy, a concept called stale-while-revalidate. Similarly, stale-if-error means that if there’s a problem on the backend box, it can serve a stale copy, too. Another concept is an invalidation channel, which is an out-of-band mechanism to tell the cache something has become stale.
Currently, Yahoo! uses Squid, but Mark expressed his belief that one of the strengths of his approach is that Caching is a commodity: Squid they could easily be replaced by something else.
Mark also warned about some pitfalls. He questioned the merit of “REST vs. WS-* wars” and mentioned that he prefers to focus on applying Web technologies in practice instead of talking about them in theory. Also interesting was his assertion that REST and HTTP are human-intuitive, but not programmer-intuitive — he finds it much harder to explain REST to programmers than to “normal” human beings. He also noted that there are different deployment and operational concerns, since people know how to handle single applications, but that knowledge is not directly transferable to such as large-scale deployment. According to Mark, formats are hard even when applying REST and HTTP — just like in the WS-* world. He also highlighted the risk of format/interface proliferation (choice quote: “if you give developers a new protocol construction toolkit, they’ll build protocols”), the problems with authentication, (“HTTP authentication mechanisms are unbelievably primitive”), and mentioned that in his opinion, tools such as intermediaries have a way to go since they are optimized for the browsing case, not the service case.
He finished his talk by describing what he believes is needed: tools, a web-friendly description language (such as WADL), a data-oriented schema language (instead of something that describes markup), a significant investment in the Atom stack (according to him, Atom/RSS can be used in 80% of the cases to mitigate interface and format proliferation), and a standardized HTTP test suite.