Beginning in 2015, Condé Nast created a natural-language-processing and content-analysis engine to improve the metadata around content created across their 22 brands. The new system has led to a 30% increase in click-through rates. Antonino Rau, a software engineer and technology manager at Condé Nast US, recently described the motivation behind the project, the system architecture, and the evolution of their NLP-as-a-service system named HAL in a two-part blog post, "Natural Language Processing and Content Analysis at Condé Nast". According to the post, the goal was to replace simple categorization and tagging with a system that could "automatically 'reverse engineer' the knowledge that [their] world-class editors put in there."
Named after HAL-9000 from the movie 2001: A Space Odyssey, HAL integrates with a proprietary content management system (CMS) called Copilot. Built using Java, HAL runs a set of analyzers using pre-trained or custom-trained models, both in-JVM and out-of-JVM.
HAL's processing engine is built upon a parallelizable, direct acyclic graph to analyse and annotate content. This analyses different aspects of the content, extracting various features. For example, by analysing the content it may extract known people and then annotate the response with linked resources about the individual. Other features include topics and categories, or locations and news stories. All of these are annotated with additional pertinent information.
The results of the analysis are curated in a way inspired by Uber's Michelangelo, to improve and train models, as well as making repeated calls to HAL for content that has remained static.
InfoQ caught up with Rau, to ask him about the work he's done around HAL.
InfoQ: In your blog post, you say, "A few years ago, in 2015, we decided to go to the next level". What was the driver to change how it worked? Were editors manually tagging their articles previously?
Antonino Rau: The main driver was to have automatic insights (topics, entities, etc...) on what the editors were producing for different use cases. This content intelligence would be then crossed with user behavior to build segments, recommendations, and other features. Yes, previously editors were manually tagging. After, they still have the possibility to remove automatic tags or add manual tags from a controlled vocabulary.
InfoQ: You decided to build your own natural language processing system in HAL. Did you look at third-party options? If so, what made you choose to build in-house?
Rau: Yes, we looked at third parties at that time, but we decided to use a mixture of custom and open source model because initially HAL was needed only for English and for that language there are plenty of open source, pre-trained models and we built custom models for only one language pretty easily for the features not supported by the OSS models. Very recently, November 2018, Condé decided to join Condé Nast US and Condé Nast International in a global platform, hence the need to support eight other languages. We are investigating the integration in HAL of third-party models to speed up the availability of HAL for all the Condé markets globally for all those languages. The nice part of HAL is that it acts also as an anti-corruption layer, so even if we integrate vendors, due to its framework, we can easily operate in a mixture of OSS, custom and vendors models/analyzers and still have the same abstracted and standardized output.
InfoQ: Why did you choose Java?
Rau: Running NLP models is very CPU and memory intensive. Moreover, from our benchmark, the best, in terms of features and performance, above-mentioned OSS models were available in Java. Finally, in terms of system performance and robustness for CPU and memory intensive apps, it seemed to us the best choice.
InfoQ: The design of HAL and the direct acyclic graph in particular is impressive in how it abstracts away for generic use. Were there many iterations before you decided on this approach? What other approaches did you consider?
Rau: Initially, it was a straight "pipe and filter" approach using the annotation model, which is pretty common in literature as mentioned in the blog post. But, then the more out-of-JVM analyzers we used, the more we noticed the fact that we could build a graph of analyzers passing annotations to each other to speed up and parallelize the processing.
InfoQ: Is anything you produced as open source for others to use?
Rau: Currently, no. Maybe in the future.
InfoQ: You mentioned the use of your in-house CMS called Copilot. Did having your own CMS help in producing HAL or do you think this could have been done with any CMS?
Rau: Copilot is backed by a set of APIs named Formation Platform. We realized that the right place for HAL was in the pipeline of the production of content, in this way the automatic enrichment is part of the content types and content models served by the APIs. But the reverse is also true, that one of the HAL components, the Copilot-linker, an instance of an Entity-linker, is mining daily Copilot content types like restaurants, people, venues, etc. to "learn" about the knowledge that the editors put in the system, so to automatically extract those entities from articles and propose links between them. So, I would say that in the context of Condé Nast, and publishers in general, content analysis and NLP are highly synergetic with the CMS. If the CMS is proprietary it is easier to make it part of the internal flow and hence streamline easily the downstream usage of this enrichment, but I guess one can augment also OSS CMS if there are extension points available at the right place.
InfoQ: What sort of volumes go through HAL?
Rau: Around 30 million requests per month. We process all the revisions with changed text and also content that is not from Condé sometime.
InfoQ: What metrics other than click-through rate do you measure and have there been any improvements in those metrics due to HAL?
Rau: HAL Topics Feature has been the most predictive features in Data Science Team predictive models, which has been used both for audience targeting and consumer subscription propensity.