InfoQ Homepage Computer Vision Content on InfoQ
-
HuggingGPT: Leveraging LLMs to Solve Complex AI Tasks with Hugging Face Models
A recent paper by researchers at Zhejiang University and Microsoft Research Asia explores the use of large language models (LLMs) as a controller to manage existing AI models available in communities like Hugging Face.
-
Carnegie Mellon Researchers Develop AI Model for Human Detection via WiFi
Researchers from the Human Sensing Laboratory at Carnegie Mellon University (CMU) have published a paper on DensePose From WiFi, an AI model which can detect the pose of multiple humans in a room using only the signals from WiFi transmitters. In experiments on real-world data, the algorithm achieves an average precision of 87.2 at the 50% IOU threshold.
-
Microsoft Brings Its Cloud Services and AI to the Edge
Microsoft recently announced the open-source release of Azure DeepStream Accelerator (ADA) in collaboration with Neal Analytics and NVIDIA, allowing developers to build Edge AI solutions with native Azure Services integration quickly.
-
Salesforce Open-Sources Language-Vision AI Toolkit LAVIS
Salesforce Research recently open-sourced LAnguage-VISion (LAVIS), a unified library for deep-learning language-vision research. LAVIS supports more than 10 language-vision tasks on 20 public datasets and includes pre-trained model weights for over 30 fine-tuned models.
-
Microsoft Introduces New UI Experience for Trying out Computer Vision with Vision Studio
Microsoft recently introduced a new User Interface (UI) for developers called Vision Studio to try its Computer Vision API.
-
Microsoft Previews Computer Vision Image Analysis API 4.0
Recently Microsoft announced the public preview of a new version of the Computer Vision Image Analysis API, making all visual image features ranging from Optical Character Recognition (OCR) to object detection available through a single endpoint.
-
Meta Announces Video Generation AI Model Make-a-Video
Meta AI recently announced Make-A-Video, a text-to-video generation AI model. Make-A-Video is trained using publicly available image-text pairs and video-only data and achieves state-of-the-art performance on the UCF-101 video-generation benchmark.
-
Microsoft Trains Two Billion Parameter Vision-Language AI Model BEiT-3
Researchers from Microsoft's Natural Language Computing (NLC) group announced the latest version of Bidirectional Encoder representation from Image Transformers: BEiT-3, a 1.9B parameter vision-language AI model. BEiT-3 models images as another language and achieves state-of-the-art performance on a wide range of downstream tasks.
-
Stability AI Open-Sources Image Generation Model Stable Diffusion
Stability AI released the pre-trained model weights for Stable Diffusion, a text-to-image AI model, to the general public. Given a text prompt, Stable Diffusion can generate photorealistic 512x512 pixel images depicting the scene described in the prompt.
-
Google Releases CameraX 1.2 Beta with MLKit Integration
Now available in beta, CameraX 1.2 brings out-of-the-box integration with some of MLKit vision APIs and a new feature aimed to reduce shutter button lag when taking pictures.
-
Google AI Open-Sourced a New ML Tool for Conceptual and Subjective Queries over Images
Google AI open-sourced mood board search, a new ML-powered tool for subjective or conceptual queries over images. Mood board search helps users to define conceptual and subjective queries like peaceful, beautiful, over images.
-
Google's Image-Text AI LIMoE Outperforms CLIP on ImageNet Benchmark
Researchers at Google Brain recently trained Language-Image Mixture of Experts (LIMoE), a 5.6B parameter image-text AI model. In zero-shot learning experiments on ImageNet, LIMoE outperforms CLIP and performs comparably to state-of-the-art models while using fewer compute resources.
-
Adobe Researchers Open-Source Image Captioning AI CLIP-S
Researchers from Adobe and the University of North Carolina (UNC) have open-sourced CLIP-S, an image-captioning AI model that produces fine-grained descriptions of images. In evaluations with captions generated by other models, human judges preferred those generated by CLIP-S a majority of the time.
-
DeepMind Trains 80 Billion Parameter AI Vision-Language Model Flamingo
DeepMind recently trained Flamingo, an 80B parameter vision-language model (VLM) AI. Flamingo combines separately pre-trained vision and language models and outperforms all other few-shot learning models on 16 vision-language benchmarks. Flamingo can also chat with users, answering questions about input images and videos.
-
LAION Releases Five Billion Image-Text Pair Dataset LAION-5B
The Large-scale Artificial Intelligence Open Network (LAION) released LAION-5B, an AI training dataset containing over five billion image-text pairs. LAION-5B contains images and captions scraped from the internet and is 14x larger than its predecessor LAION-400M, making it the largest freely available image-text dataset.