DeepSeek has released Janus-Pro, an updated version of its multimodal model, Janus. The new model improves training strategies, data scaling, and model size, enhancing multimodal understanding and text-to-image generation.
Janus-Pro separates visual encoding for understanding and generation tasks, addressing stability and performance issues. The model also incorporates synthetic aesthetic data to enhance text-to-image generation, and it follows an autoregressive framework that separates visual encoding pathways for multimodal understanding and generation while maintaining a single transformer architecture. This design increases flexibility and reduces conflicts in the visual encoder's roles, achieving competitive performance with task-specific models while keeping a unified structure.
Janus-Pro improves multimodal understanding and visual generation performance. Multimodal understanding is measured using the average accuracy of POPE, MME-Perception (scaled), GQA, and MMMU. Visual generation is evaluated using GenEval and DPG-Bench. Janus-Pro outperforms previous unified multimodal models and some task-specific models.
The model is based on DeepSeek-LLM-1.5B and DeepSeek-LLM-7B. The larger model performs better on benchmarks like MMBench and GenEval. It uses SigLIP-L as its vision encoder and supports 384x384 image inputs. Image generation relies on a tokenizer with a downsampling rate of 16.
DeepSeek's Janus-Pro-7B and OpenAI's DALL-E 3 are both advanced models in text-to-image generation. According to DeepSeek, Janus-Pro-7B outperforms DALL-E 3 in benchmarks such as GenEval and DPG-Bench. This performance is attributed to Janus-Pro-7B's improved training processes, data quality, and model size, contributing to more stable and detailed images.
The release of DeepSeek Janus has generated significant buzz and comments, Vedang Vatsa FRSA shared:
DeepSeek’s Janus-Pro-7B is here. Outperforms DALL-E 3 & Stable Diffusion on GenEval/DPG-Bench. Separates understanding/generation, scales data/models for stable image gen. Unified, flexible, cost-efficient. Open-source win!
AI expert Huzaifa Shoukat posted:
DeepSeek's new Janus Pro model is impressive. It's a multimodal LLM that understands images and generates them too. The 1B model runs in the browser using WebGPU via Transformers.js.
Janus-Pro is available on GitHub under the MIT License, with model usage governed by the DeepSeek Model License. Users can refer to the repository for setup instructions.