This article is a brief summary of what was discussed at China Tech Day 2015, held at QCon San Francisco on Nov 18th 2015. Speakers from Alibaba, Tencent, Baidu, Ele.com, JD.com, Cheetah Mobile, and Ctrip gave presentations during the event.
Slides from these talks can be accessed now on the QConSF website.
Technological Innovation at Alibaba
Yuan Qi, the VP and Chief Data Scientist of the Ant Financial Services Group, shared the recent achievements of Alibaba technology.
In 2009, Alibaba faced serious problems related to data. Purchasing IBM servers, Oracle databases and EMC storage was no longer affordable due to the fast growth of the business, and isolated data storage between businesses without a common data standard made it difficult and expensive to share data within the company.
To solve this problem, Alibaba strengthened its technology teams and initiated various projects, including Aliyun (Alibaba Cloud) starting from Sept 2009 and ODPS (Open Data Processing Service) starting from Apr 2010. As a result, Aliyun and ODPS now carry hundreds of petabytes (PB) of data from dozens of business units. Aliyun is capable of running 5000 nodes in a single cluster, and ODPS is capable of processing 100 PB of data in 6 hours.
Alibaba now has datacenters in Beijing, Hangzhou, Qingdao, Shenzhen, Hong Kong, Shanghai, Singapore and Silicon Valley, and will add Zhangbei (in the Hebei province), Japan, Germany and Dubai to this list. The Hangzhou datacenter near Thousand Island Lake is using a water cooling system, and the Zhangbei datacenter will draw electricity from local wind and solar energy systems.
There are lots of experiments on applied data science beyond search engines, advertisements and recommendation engines. An AI based customer service robot now serves over 95% of Alipay users, which heavily rely on speech recognition, deep learning, natural language processing and knowledge base construction. The micro-loans service provided to small enterprises requires a risk control model equivalent to those used in the financial industry, but with more agility and flexibility. A shipping insurance service for returned products on the Taobao platform always faces risks of incorrect pricing and fraud, and a dynamic pricing model significantly increased its accuracy.
Architecture @ Eleme
Haochuan Guo, the Chief Infrastructure Architect of Ele.com, shared the evolution history of their service.
Ele.com is the largest food online ordering/delivery service in China. This company started in 2009, and experienced a 10x+ growth in 2014. It is now serving over 3 million orders per day, and during lunch time this could peak at 300 orders per second.
The initial stack was a simple PHP application running on Nginx in a single server. When the load became heavy, more servers were added with an HAProxy server in front of them. This structure was capable of serving up to a few hundreds of thousands of orders per day. Then the 10x+ growth came.
To enhance scalability, Ele.com started with decoupling its PHP application. The single application was broken down to a User Service and an Ordering Service, in which the User Service was re-written using Python. More machines were added and HAProxy servers were put in between these machines.
Business quickly grew to a million orders per day and the structure could barely cope. An F5 gateway was introduced to replace the front-end HAProxy, and HHVM was used to replace FPM in oder to enhance the performance of the PHP application. A large number of caches were used. Heavy and unimportant APIs were downgraded. Queries were optimized. Databases (MySQL and PostgreSQL) were partitioned based on domains.
To prepare for further growth, the Ele.com team started new work ahead of time. A MySQL Proxy was introduced to allow connection reuse, rate limiting, query rejection, read/write separation and sharding. A web service orchestrator was introduced to allow front-end/back-end separation. The application was further decoupled into shopping, marketing, booking and ordering APIs, and a service registry was added, working together with an API proxy.
The Ele.com team is now planning to simplify its architecture by putting all APIs under the management of an API Service Orchestrator, and to remove the legacy code from the stack.
JD Internet+ transformation
Gang He, the VP of Technology for JD.com, shared the current architecture of JD Cloud and some recent achievements.
Currently, all JD.com applications run on top of their cloud. The cloud infrastructure has an IDC physical layer at the bottom, a software-defined datacenter layer (JDOS) above it, and a container cluster scheduling layer above the software-defined datacenter. JD cloud uses OpenStack and Docker extensively with 100K+ containers running in production,. They also developed their own file system (the “Jingdong" file system) for storage.
JD OpenStack uses Docker as a hypervisor (docker virt driver) and uses the Nova scheduler as the Docker scheduler. On the network level, Neutron is used with integrated OpenVSwitch(OVS). The JD team worked on OVS to improve its latency for small network packets by 20%. Glance is used for image management.
Search Technology on Mobile App @Baidu Mobile
Chao Han, the Chief Technical Architect of Baidu Mobile, shared their experiences on architecture design and mobile search.
The architecture of Baidu Mobile is a super client-server application for mobile app users. A client plug-in system is applied to both Web UI and native UI.
A super client-server application extensively re-uses web app concepts in the client UI implementation. The main design concept is a cross-platform template. As such, performance, compatibility and controllability work can easily be shared among different platforms.
Performance utilization focuses on two aspects - the server performance, and customer experience on the client side (display speed for example). In the end four optimization points were used: server page cache, async loader and displayer, request merging, and cache negotiation.
Good compatibility requires good upgrade policies. Servers always update, but not so for clients. For code policy, it is required to be forward compatible on the client side, and backward compatible on the server side. For cache policy, it is required to distribute cache adjustment on the server side. For storage policy, it is required to optimize persistent storage on the server side.
Controllability is about updating contents when clients are not updated by users. From no template, to static template, to dynamic template, to dynamic code, different levels of controllability could be achieved by several template methods, while complexity would be a trade-off.
Mobile Monetization: Scenario Design & Big Data
Arther Wu, the Director of Monetization and Business Operation of Cheetah Mobile, shared their experiences on mobile monetization, scenario design, and data analysis.
The Clean Master app has various ad formats. For user engagement without intruding into their behaviour flow, ads are usually placed after user actions as an extension of a behaviour flow (e.g. after a user takes a photo, edits it, and shares it).
Precise audience targeting is crucial in mobile ads, so building user profiles is important. Through data mining of raw data, user insights could be generated (such as Google Play app categories, interest tags, and click preferences). Various algorithms are used for supervised learning and un-supervised learning.
Transformation from call center, web to mobile
Eric Ye, the CTO and SVP of Technology of Ctrip, shared the evolution path and key utilization of the Ctrip Mobile App.
As the biggest travel service platform in China, Ctrip now serves over 3 million bookings per day, with 72% from mobile phones. Yet in the year 2011, Ctrip only had 25% of bookings online, while the other 75% were from call center.s The Transformatuib from call centers to mobile businesses took a lot of effort.
Major work started in 2012 when the Ctrip website was re-architected, and the UI was re-designed. New architecture encouraged open APIs, and search was optimized. A centralized mobile business unit was built in 2013, but this caused a lot of trouble in 2014 - since many new business units were added in the year, and each business unit was fighting for mobile resources. Lots of work was repetitive and inefficient. The dev infrastructure was also a bottleneck.
As such, a re-organization was initiated in 2014 and mobile resources were decentralized. Mobile app types were reduced to iOS and Android platforms only, as compared to multiple device support (iPhone/iPad/Android phone/Android tablet/Windows phone). Everything was decoupled into Microservices. Mobile apps used to run on its own business modules, so for example the hotel app had its own business modules,as did flight app and other apps. Now all services can be accessed via the data/url bus.
The Philosophy of Mass Services at Tencent
Bison Liao, the Director of Tencent Social Network Group, summarized their philosophy of mass services into four key points.
1. Flexible availability. All features are decoupled and graded according to importance. In QQ IM, feature importance is prioritized in the order of login > text message > send photo/file > get buddy's input status. Each module is set a proper timeout.
2. “Comfort notices". “Comfort notices” are especially important when a service becomes unavailable. A simple "login failed" or similar failure notice would usually cause confusion for users, users might even do multiple retries which further overloads the backend servers. It is also important to make comfort notices automated.
3. Process crash. All processes need to be monitored, so that in the case of a crash they can be restarted in milliseconds. Tencent also has a SET model, which is a duplicate of core modules to serve important features and unimportant features independently.
4. Overload protection. The system is designed to handle the case when the buffer becomes full. Requests that stay a long time in the buffer are discarded, and number of retries are limited.