Mistral Releases Pixtral 12B, Its First-Ever Multimodal AI Model

Mistral AI has launched Pixtral 12B, its first multimodal model with language and vision processing capabilities, positioning it to compete with AI leaders like OpenAI and Anthropic. You can download its source code from Hugging Face, GitHub, or via a torrent link. VentureBeat reports: While the official details of the new model, including the data it was trained upon, remain under wraps, the core idea appears that Pixtral 12B will allow users to analyze images while combining text prompts with them. So, ideally, one would be able to upload an image or provide a link to one and ask questions about the subjects in the file. The move is a first for Mistral, but it is important to note that multiple other models, including those from competitors like OpenAI and Anthropic, already have image-processing capabilities.

When an X user asked [Sophia Yang, the head of developer relations at the company] what makes the Pixtral 12-billion parameter model unique, she said it will natively support an arbitrary number of images of arbitrary sizes. As shared by initial testers on X, the 24GB model’s architecture appears to have 40 layers, 14,336 hidden dimension sizes and 32 attention heads for extensive computational processing. On the vision front, it has a dedicated vision encoder with 1024×1024 image resolution support and 24 hidden layers for advanced image processing. This, however, can change when the company makes it available via API.