Key takeaways
|
Yahoo uses Hadoop for different use cases in big data and machine learning areas. The team also uses deep learning techniques in their products like Flickr and Esports.
InfoQ spoke with Peter Cnudde, VP of Engineering, on how Yahoo leverages Hadoop and big data platform technologies.
InfoQ: What use cases or applications at Yahoo are currently using Hadoop?
Peter Cnudde: Since Hadoop was created at Yahoo 10 years ago, it’s been one of the most critical underlying technologies that powers our business and enables core product experiences. We initially applied Hadoop to web search, but over the years, it’s become central to everything we do for our 1B+ users worldwide. Whether it’s content personalization for increasing engagement, ad targeting and optimization for serving the right ad to the right consumer, new revenue streams from native ads and mobile search monetization, mail anti-spam or fun features like Flickr’s Magic View -- Hadoop touches them all. We have nearly 300 unique use cases of the Hadoop platform today across our different businesses.
InfoQ: Does your team also use Apache Spark for Big Data processing and analytics requirements?
Cnudde: Yes, we have several teams using and experimenting with Spark. In fact, Spark now corresponds to 12% of monthly compute usage on our Hadoop clusters (as of July 2016). Yahoo was actually an early sponsor of Spark when it was being developed at UC Berkeley, and we continue to use and evolve it to this day. Our biggest challenges are still around scale and performance, but improvements are constantly being made.
One thing to note is that we don’t use Spark for traditional analytics workloads or ETL processes as we have found Hive and Pig on Tez respectively to be better solutions for us today. Spark’s traction is predominantly around more advanced memory-heavy use cases, like graph computing and machine learning.
InfoQ: How do you use Hadoop for Machine Learning? What use cases or business problems are solved by ML programs?
Cnudde: Like Hadoop, machine learning is key to every part of our business, from image recognition, to advertising targeting, to search rankings, to abuse detection, to personalization. We’re continuously looking for better machine learning solutions to data-intensive problems. We developed scalable machine learning algorithms on Hadoop clusters for classification, ranking, and word embedding based on a home-grown parameter server. These clusters have now become the preferred platform for large-scale machine learning at Yahoo. One example is how we’re implementing personalized algorithms to better track what stories or properties (News, Finance etc.) our users are more likely to read. Instead of just using a “click” as the basic unit of engagement, machine learning enables us to track exactly how long a person spends reading an article, or if they are reading related stories. Another example is where we developed a distributed word embedding algorithm to match user queries against ads with similar semantic vectors, instead of traditional syntactic matching. Our word embedding algorithm handles 100’s of millions of vocabulary words, 10x larger than alternative implementations in the industry. Through these algorithms, we are able to better understand user needs and interests, and enhance our products and properties, and tailor search service to serve our users and advertisers better.
InfoQ: How do you use Deep Learning in products like Flickr and Esports? Can you discuss the algorithms and techniques you are using?
Cnudde: Deep learning powers Flickr’s scene detection, object recognition, and computational aesthetics that make it easier to categorize and organize photos automatically with better results. We employ a deep convolutional neural network that transforms an input image into a short floating-point vector. We pass this floating-point vector into more than 1000 binary classifiers, each of which is trained to give us a yes/no answer to identify a specific object/scene class. CaffeOnSpark has enabled Flickr to train millions of photos on Hadoop clusters, and improve classification accuracy significantly. The improved accuracy has benefited Flickr users with better image search results.
With Esports, we detect game highlights automatically, in real time, from live streamed videos. Our solution is based on computer vision and deep learning, where we train a model to “watch” the game and to predict whether or not any given moment in a video is a highlight, based on hundreds of hours of game videos annotated by domain experts. We are currently using our solution for two applications — automatic tweet generation and match summary generation.
In general, detecting highlights from any type of video is very challenging because of the subjective nature in the problem — how do we define a highlight? Instead of building a system with multiple visual recognizers for detecting visual characteristics (like a big splash of lights or turrets in League of Legends), our solution is based on convolutional neural networks, a class of models composed of multiple layers where each layer extracts increasingly high-level information from the previous layer. These networks can be trained in an end-to-end fashion with labeled examples: the network takes as an input an image or a short video segment, reads them in the form of pixel values, then successively transforms the information into a semantic understanding of what is shown in an image on the highest layer so that it produces an output value that is similar to the given label. Simply put, we can train a model to learn what are important visual characteristics that define game highlights.
Our solution brings us multiple benefits. First, our system requires no human intervention at runtime because the model detects game highlights from video automatically once trained properly; this allows us to scale up to multiple games and matches day and night. Second, we can standardize development process for multiple game titles -- the only thing that is different across games is the training dataset, which we annotate with help from domain experts.
InfoQ: Can you talk about best practices in implementing a Machine Learning solution in terms of scalability, performance, and security?
Cnudde: Scaling and evolving any platform without sacrificing speed and stability is hard, and everyone should expect challenges. Implementing scalable machine learning algorithms directly on top of Hadoop clusters have made things easier for us in many ways, particularly when it comes to data movement and security. We run algorithms directly on HDFS datasets present on Hadoop clusters and make use of Hadoop’s native security features. We have also enhanced our Hadoop clusters with high-memory and GPU servers to run parameters servers for large-scale machine learning and deep learning applications respectively. We make extensive use of YARN features to operate these heterogenous clusters. Networking has also been enhanced with 100G Infiniband connections between GPU servers in addition to traditional 10G Ethernet that most Hadoop clusters have today for server-to-server direct communication. The primary purpose of these enhancements is to avoid scale bottlenecks and speed up learning.
We also expect to see deep learning carrying machine learning forward. Deep learning has been highly interesting to academia, and deep learning algorithms are now beating the traditional machine learning algorithms across many benchmarks. Deep learning today still requires far more technical expertise than something like Spark, but that’s rapidly changing. There’s a strong interest from the Spark community to make it better integrated and just as easy to run a deep learning algorithm as it is to run a standard one. Caffe-on-Spark is one such approach we have developed that allows organizations to turn their existing Hadoop or Spark clusters into a powerful platform for deep learning that is fully distributed and supports incremental learning. A high-level API allows users of Caffe-on-Spark to launch it on any cloud such as AWS EC2.
InfoQ: What are the challenges your team has encountered in this implementation?
Cnudde: We’re proud to have one of the world’s largest Hadoop footprints with over 36,000 servers, 680 petabytes of HDFS to store data across 17 YARN clusters running 40 million jobs monthly. This footprint gets close to 45,000 servers if you include additional 23 multi-tenant HBase and Storm clusters.
We’ve also consistently been one of the first to adopt emerging Hadoop technologies and stabilize them to production quality. Being first has its obvious challenges and rewards -- you uncover issues before others and have to fix them or get them fixed, but are also the one to reap the benefits ahead of others. We believe that our approach moves Hadoop as a technology forward along with contributions from numerous companies and individuals around the world.
Second is that we run these technologies at web-scale and observe a lot of unique problems at that scale that we need to deal with. Often times, these are the harder problems to detect and fix as they do not surface at smaller scale.
And lastly, we operate all our clusters as shared multi-tenant clusters to keep our costs down and utilization high. Security and resource management / isolation become important, and we spend time and effort solving for challenges we encounter. For example, we run our clusters with security-enabled and highly protected, enabling multiple teams to share the same infrastructure. We achieve good isolation by running each task in a separate cgroup container tightly controlling the memory and CPU resources they can consume.
About the Interviewee
Peter Cnudde is Vice President of Engineering for Yahoo, where he oversees the company’s big data and machine learning platforms. He is particularly interested in large scale machine learning and its impact on our society. In the past, Peter has worked at several wireless telecommunications companies including Alcatel and RF Micro Devices. He received his masters degree in Electrotechnical Engineering from the University of Ghent in Belgium.