BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How Alibaba Catered To $3 Billion Sales In A Day

How Alibaba Catered To $3 Billion Sales In A Day

This item in japanese

Chinese Ecommerce Giant, Alibaba, recently managed to sell $3 billion worth of product in a single 24 hour period. InfoQ got a chance to ask a few questions to Zhuang Zhuoran and Youtan, architects from Tmall and Taobao, about the challenges of handling such loads and how they meet them.

Tmall, China’s leading B2C e-commerce site, and Taobao, the largest C2C online shopping platform in China, are both subsidiaries of Alibaba Group with a total of more than 500 million registered users. This year marks the fourth consecutive year of Taobao “Double Sticks Promotion”, which witnessed Gross Merchandise Volume of RMB 19.1 billion (roughly $3 billion) from a total of 147 million user visits.

On the challenges of making e-commerce work at "China Scale":

On 11 November, 2012 (the Double Sticks promotion day), Tmall and Taobao witnessed 147 million user visits, purchases of 30 million people and nearly 100 million paid orders. At 0:00, more than 10 million users were concurrently online. The technical team faced several great challenges, such as how to satisfy various functional needs of Double Sticks, how to make a complete and accurate assessment of the system in the preparation process, how to effectively implement various optimization and disaster recovery plans, how to make right decisions in case of emergencies, as well as how to ensure the stability, performance and user experience of the network under the impact of mass traffic.

The processing peak of Tmall transaction system appeared in the first hour, when the system successfully processed 13,000 request orders per second. The system peak was 40,000 QPS (queries per second) with an average response time of 200ms. Tmall Product Details Page witnessed up to 1.6 billion system visits, with the peak throughput reaching 69,000 visits/sec and the response time retaining 12ms at the peak time. Tmall saw its page view rising to 590 million, with the peak throughput reaching 14,000 visits/sec.

Zhuang explains that at the application level, applications of Tmall and Taobao are all built on the self-developed service-oriented architecture along with MVC framework and Spring. This is supported by a distributed file system, distributed caching, messaging middleware and CDN network bandwidth. The core database is accessed through a self-developed data-access middleware and the horizontal splitting and data transportation of underlying database are completely transparent to the applications. 

Based on this scale-out architecture, Tmall and Taobao systems can flexibly add machines to cope with the traffic flow pressure caused by promotion activities. 

We spend a lot of time calculating capacity, conducting in-depth analysis of the dependencies between all applications of the website, proportion of flow distribution and call links within applications, making accurate QPS assessment of various stand-alone machines through online pressure test in early stages, so as to make an objective judgment about the cluster processing capacity. It is really challenging to operate this process, because Tmall and Taobao systems are essentially not weakly coupled, and pressure test of a single system can not reflect the system bottleneck effectively. Meanwhile, we cannot completely copy the online environment and configuration to build a complete environment for pressure testing, instead, we should rely more on the online pressure test to truly reflect the system shortcomings.

Finally, we estimate the expected business target based on the site's natural growth trend and historical data of Double Sticks and then caculate the expansion goal of each system correctly according to the estimated business target.

Merely relying on horizontal expansion can reduce the machine utilization after sales peak and greatly increase the dependence on the flexible allocation capabilities of the operation and maintenance personnel. Therefore, this year we tried elastic computing framework for some applications, such as cloud.tmall.com, in which different applications of different merchants share the system resources of one cluster. On November 11, 2012, its bandwidth, VM and storage resources were flexibly upgraded. Many of our internal applications also adopt this mechanism, which marks a technical breakthrough during our preparation for this year's Double Sticks promotion.

Taobao and Tmall teams have conducted targeted optimizations of the system, including the optimization of SQL and cache hit rate, adjustment of the database connection and application server parameters, JVM parameter configuration, as well as code review and inspection when preparing for the Double Sticks promotion. Besides, they employ a large number of solid state drives (SSD) to improve the overall performance of the database storage.

The teams also have a business-downgrading and traffic restriction plan for shutting down non-core operations if the load increases beyond what is expected.

Business downgrading means cutting non-core business functions to ensure the stable operation of the core functions. In order to realize elegant business downgrading, we have to split functions into relatively separate code units, isolate them by priority, then control them in the background to downgrade some non-core business functions, so as to reduce system dependence and performance loss and enhance the overall throughput of the cluster.

If downgrading is not sufficient, we need to restrict the traffic flow. First, we control the application flow by queuing the web applications at the front end, i.e., use the custom module of the web server to enable QPS flow restriction function and perform mandatory QPS flow control according to the maximum pressure that the protected web server can withstand, after which users will enter the waiting page. In order to avoid the unbearable avalanche-effect in the back-end services caused by the surge in the traffic flow of a web application in the front end, we restrict the traffic flow of low-priority business in the back-end services. This would ensure that the back-end services will not be overwhelmed by different sources of business pressure and guarantee the access to the core business.

Tmall and Taobao prepared a total of more than 400 system downgrading plans for 2012 Double Sticks Promotion.

To ensure the accurate implementation of all downgrading and flow restricting plans, we conducted several drills in the preparation process. We wish that we would never use these emergency plans, but we must ensure the accuracy and convenience of each plan.

Emergency Decision making process:

On November 11, more than 400 engineers worked together to ensure the smooth functioning of the whole event. For a short decision-making process, we established a field intelligence sorting centre responsible for collecting and consolidating customer feedback and eliminating duplicate and invalid feedback from different information sources, including customer service, operations, safety, product and merchants. This would ensure that there would be no information overload for the technical team.

Secondly, although we have field headquarters, the decision-making responsibilities in-case of emergencies lies with the front-line development engineers. Roles and responsibilities of all engineers working together are clearly defined. Each application is allocated with 1-2 core owners, who make emergency decisions based on the changes of various system indicators in the monitoring center, so as to ensure timely response. The emergency decision would be escalated to the headquarters only when it comes to large business impact or huge damages to the user experience.

Taobao and Tmall also have an effective open source strategy in place with a lot of code being open sourced at code.taobao.org. Several frameworks such as the remote communication framework HSF, mesaging middleware Notify and the data access middleware TDDL have been open sourced.

Rate this Article

Adoption
Style

BT