Simon Eskildsen, senior product engineering lead at Shopify, gave an overview of how Shopify is architectured to support large sales at GoTo Copenhagen 2017. This included their OpenResty configured NGINX instances, shop and pod isolation architecture, failover strategies and more.
Shopify powers a huge amount of online commerce, supporting over 500K merchants, explains Eskildsen. One of their main requirements is to deliver flash sales for their customers, many of whom are only online shops. Whilst driving the sales traffic itself can be straightforward, the main challenge is creating an architecture that can cope with the spikes.
Eskildsen explained that one of Shopifys first optimisation challenges is at the DNS layer. Whilst it’s easy to optimise traffic at the DNS layer with a single domain, with multiple domains it’s not possible. This is the case with Shopify, where multiple customer domains all point to their IP. Instead, they made use of TCP/ICP anycast, a “gossip algorithm where each ISP is telling each of its neighboring ISP which IPs it knows how to route”. Essentially, this leads to traffic always picking the closest IP.
Shopify also makes heavy use of OpenResty, a tool which allows an NGINX load balancer to be scripted with Lua to do whatever is required. Eskildsen emphasized that in the case of Shopify, OpenResty has been extremely powerful, and believes in general that it is underused within the industry. He listed some of their modules as:
- Rule banner: detects bots by looking at patterns such as irregular refreshes and suspicious IP addresses, and then bans them. The goal is to shut down secondary markets.
- Edge cache: a caching optimisation which allows serving content from cache at the loadbalancer tier rather than the application tier.
- Checkout throttle: throttles writes for some merchants in exceptional load circumstances by queuing them, preventing a shop from becoming a noisy neighbor to others on the same shard.
At the application or data tier, Eskildsen also introduced the concept of a pod, a fully isolated Shopify instance which can contain many shops. Interpod and intershop communication is not possible by design, as isolation between customers is key: “Shop isolation principle - all shops must be isolated from each other”.
Whilst each pod contains its own stateful services (such as databases), stateless services are shared. Eskildsen explains that the main reason for this is that auto-scaling introduces too much of a bottleneck; traffic spikes would come faster than new infrastructure can be provisioned.
Eskildsen also gave an overview of their pod load balancer, which balances shops between pods and makes sure that load is evenly distributed. To do this quickly and without loss of data consistency, they make use of the MySQL bin log to stream events of database modifications from the old to new instances.
For cross-region migration, Shopify also has a component called pod mover, which moves pods between regions with minimum downtime. This is invoked if a region is not functioning, but Eskildsen explains that the end goal is to be able to issue a Slack command in order to trigger this type of a failover whenever they want.
The full talk is available to watch online, with a more detail exploration or the architecture.