Meta has released NotebookLlama, an open-source toolkit designed to convert PDF documents into podcasts, providing developers with a structured, accessible PDF-to-audio workflow. As an open-source alternative to Google’s NotebookLM, NotebookLlama guides users through a four-step process that converts PDF text into audio content, without needing prior experience with large language models (LLMs) or audio processing. The toolkit offers a practical way for users to experiment with LLMs and TTS models to create conversational, audio-ready content.
NotebookLlama's workflow includes:
- PDF Pre-processing: Using the Llama-3.2-1B-Instruct model, the toolkit cleans and formats PDF content into plain text, maintaining structural integrity.
- Transcript Generation: The Llama-3.1-70B-Instruct model crafts the plain text into a script suitable for podcast format, selected for its capabilities in creating engaging, conversational text.
- Dramatize Podcast: The Llama-3.1-8B-Instruct model further adjusts the transcript, enhancing its conversational appeal for audio audiences.
- Text-to-Speech (TTS) Conversion: The final audio is produced using Parler-tts and bark TTS models, with prompts tailored to simulate distinct speakers.
(Source: NotebookLlama GitHub Repository)
Running NotebookLlama requires a GPU server or an API provider for the larger models. The 70B model, for instance, needs around 140GB of aggregated memory. The toolkit is available through GitHub, and users have to log in to Hugging Face for model access.
NotebookLlama has received significant community feedback since its launch. While users appreciate the flexibility of the open-source model, several pointed out limitations when comparing it to Google’s proprietary system, particularly in voice quality.
In response to AI-generated text quality, John K. Moran added:
While NotebookLlama offers exciting features, the ongoing issue of hallucinations in AI-generated content is a real concern. Accuracy is paramount, especially when it comes to generating documentation or analysis for code. Both NotebookLlama and NotebookLM will need to prioritize this to gain trust among developers and users alike.
Future improvements for NotebookLlama include refining the Text-to-Speech model to achieve more natural-sounding audio and exploring the potential of using two LLMs to create interactive podcast scripts, enhancing the conversational feel. The developers are also experimenting with larger models, like the 405B, to improve transcript quality. Other planned updates include broader input options, such as website or YouTube links, and better prompt design.
Meta encourages experimentation with model selection and prompt tuning. The community is invited to contribute and create PRs.