Google recognizes that the need for labeled data in machine learning (ML) is a significant bottleneck and recently adapted the open-source Snorkel framework to overcome the problem at scale. Google collaborated with Stanford and Brown University in this research. Google documented the results in their AI blog and a scientific research paper titled "Snorkel Drybell: A Case Study in Deploying Weak Supervision at Industrial Scale."
Snorkel uses software algorithms to create labels for training data instead of hand labeling data. This technique is known as weak supervision. The algorithms can use any available organizational knowledge including knowledge graphs, rules, and statistics. Multiple algorithms can be used to provide training labels on the same data. Each algorithm can assign one or more labels or abstain from assigning a label. Then Snorkel automatically weights the algorithms based on an estimate of their accuracy. Snorkel creates the estimate of accuracy by comparing agreement and disagreement of the labels provided by the multiple weak supervision labeling algorithms. Snorkel completes its work by creating a single probabilistic label for each data point based on the algorithm weights and associated labels.
Google adapted Snorkel to create Snorkel Drybell with the expressed intent of handling web-scale data. Google enhanced the original design of a single node with shared memory computations by integrating Snorkel with TensorFlow. Google did not enforce a strict context hierarchy for the data model that represents training data, which was initially present in Snorkel. Google also moved away from the usage of a database for storing data and instead uses a distributed file system to share data. Lastly, Google made the independent labeling functions separate executables that share data through the filesystem. These changes allowed Google to scale the open source Snorkel project to use large amounts of organizational knowledge at web scale to label data using weak supervision algorithms.
Google achieved similar predictive accuracy with two separate models using Snorkel Drybell when compared to training models that used 12,000 and 80,000 hand-labeled data points. Additionally, Google boosted its performance by an average of 52% on a benchmark dataset by using Snorkel Drybell on offline slow to obtain features to support the training of models that use separate but associated online features.
Snorkel, the original open source version, existed before the research by Google and was created by the Stanford DAWN. The DAWN homepage states, "DAWN is a five-year research project to democratize AI by making it dramatically easier to build AI-powered applications. Snorkel is one project in their project portfolio." The vision of DAWN and using weak supervision in software 2.0 can be found in "Infrastructure for Usable Machine Learning: The Stanford DAWN Project" and "The Role of Massively Multi-Task and Weak Supervision in Software 2.0" respectively.