Home Gen AI News Talk Pixtral-12B-2409 is now available on Amazon Bedrock Marketplace

Pixtral-12B-2409 is now available on Amazon Bedrock Marketplace

1

Today, we are excited to announce that Pixtral 12B (pixtral-12b-2409), a state-of-the-art 12 billion parameter vision language model (VLM) from Mistral AI that excels in both text-only and multimodal tasks, is available for customers through Amazon Bedrock Marketplace. Amazon Bedrock Marketplace is a new capability in Amazon Bedrock that enables developers to discover, test, and use over 100 popular, emerging, and specialized foundation models (FMs) alongside the current selection of industry-leading models in Amazon Bedrock. You can also use this model with Amazon SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms and models that can be deployed with one click for running inference.

In this post, we walk through how to discover, deploy, and use the Pixtral 12B model for a variety of real-world vision use cases.

Overview of Pixtral 12B

Pixtral 12B, Mistral’s inaugural VLM, delivers robust performance across a range of benchmarks, surpassing other open models and rivaling larger counterparts, according to Mistral’s evaluation. Designed for both image and document comprehension, Pixtral demonstrates advanced capabilities in vision-related tasks, including chart and figure interpretation, document question answering, multimodal reasoning, and instruction following—several of which are illustrated with examples later in this post. The model processes images at their native resolution and aspect ratio, providing high-fidelity input handling. Unlike many open source alternatives, Pixtral 12B achieves strong results in text-based benchmarks—such as instruction following, coding, and mathematical reasoning—without sacrificing its proficiency in multimodal tasks.

Mistral developed a novel architecture for Pixtral 12B, optimized for both computational efficiency and performance. The model consists of two main components: a 400-million-parameter vision encoder, responsible for tokenizing images, and a 12-billion-parameter multimodal transformer decoder, which predicts the next text token based on a sequence of text and images. The vision encoder was specifically trained to natively handle variable image sizes, enabling Pixtral to accurately interpret high-resolution diagrams, charts, and documents while maintaining fast inference speeds for smaller images such as icons, clipart, and equations. This architecture supports processing an arbitrary number of images of varying sizes within a large context window of 128k tokens.

License agreements are a critical decision factor when using open-weights models. Similar to other Mistral models, such as Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, and Mistral Nemo 12B, Pixtral 12B is released under the commercially permissive Apache 2.0, providing enterprise and startup customers with a high-performing VLM option to build complex multimodal applications.

Performance metrics and benchmarks

Pixtral 12B is trained to understand both natural images and documents, achieving 52.5% on the Massive Multitask Language Understanding (MMLU) reasoning benchmark, surpassing a number of larger models according to Mistral. The MMLU benchmark is a test that evaluates a language model’s ability to understand and use language across a variety of subjects. The MMLU consists of over 10,000 multiple-choice questions spanning a variety of academic subjects, including mathematics, philosophy, law, and medicine. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning, and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process multiple images in its long context window of 128,000 tokens. Unlike previous open source models, Pixtral doesn’t compromise on text benchmark performance to excel in multimodal tasks, according to Mistral.

You can review the Mistral published benchmarks

Prerequisites

To try out Pixtral 12B in Amazon Bedrock Marketplace, you will need the following prerequisites:

Deploy Pixtral 12B in Amazon Bedrock Marketplace

On the Amazon Bedrock console, you can search for models that help you with a specific use case or language. The results of the search include both serverless models and models available in Amazon Bedrock Marketplace. You can filter results by provider, modality (such as text, image, or audio), or task (such as classification or text summarization).

To access Pixtral 12B in Amazon Bedrock Marketplace, follow these steps:

  1. On the Amazon Bedrock console, choose Model catalog under Foundation models in the navigation pane.
  2. Filter for Hugging Face as a provider and choose the Pixtral 12B model, or search for Pixtral in the Filter for a model input box.

The model detail page provides essential information about the model’s capabilities, pricing structure, and implementation guidelines. You can find detailed usage instructions, including sample API calls and code snippets for integration.

The page also includes deployment options and licensing information to help you get started with Pixtral 12B in your applications.

  1. To begin using Pixtral 12B, choose Deploy.

You will be prompted to configure the deployment details for Pixtral 12B. The model ID will be prepopulated.

  1. Read carefully and accept the End User License Agreement (EULA).
  2. The Endpoint Name is automatically populated. Customers can choose to rename the endpoint.
  3. For Number of instances, enter a number of instances (between 1–100).
  4. For Instance type, choose your instance type. For optimal performance with Pixtral 12B, a GPU-based instance type like ml.g6.12xlarge is recommended.

Optionally, you can configure advanced security and infrastructure settings, including virtual private cloud (VPC) networking, service role permissions, and encryption settings. For most use cases, the default settings will work well. However, for production deployments, you might want to review these settings to align with your organization’s security and compliance requirements.

  1. Choose Deploy to begin using the model.

When the deployment is complete, Endpoint status should change to In Service. After the endpoint is in service, you can test Pixtral 12B capabilities directly in the Amazon Bedrock playground.

  1. Choose Open in playground to access an interactive interface where you can experiment with different prompts and adjust model parameters like temperature and maximum length.

This is an excellent way to explore the model’s reasoning and text generation abilities before integrating it into your applications. The playground provides immediate feedback, helping you understand how the model responds to various inputs and letting you fine-tune your prompts for optimal results.

You can quickly test the model in the playground through the UI. However, to invoke the deployed model programmatically with Amazon Bedrock APIs, you need to use the endpoint ARN as model-id in the Amazon Bedrock SDK.

Pixtral 12B use cases

In this section, we provide example use cases of Pixtral 12B using sample prompts. We have defined helper functions to invoke the Pixtral 12B model using Amazon Bedrock Converse APIs:

def get_image_format(image_path):
    with Image.open(image_path) as img:
        # Normalize the format to a known valid one
        fmt = img.format.lower() if img.format else 'jpeg'
        # Convert 'jpg' to 'jpeg'
        if fmt == 'jpg':
            fmt = 'jpeg'
    return fmt

def call_bedrock_model(model_id=None, prompt="", image_paths=None, system_prompt="", temperature=0.6, top_p=0.9, max_tokens=3000):
    
    if isinstance(image_paths, str):
        image_paths = [image_paths]
    if image_paths is None:
        image_paths = []
    
    # Start building the content array for the user message
    content_blocks = []

    # Include a text block if prompt is provided
    if prompt.strip():
        content_blocks.append({"text": prompt})

    # Add images as raw bytes
    for img_path in image_paths:
        fmt = get_image_format(img_path)
        # Read the raw bytes of the image (no base64 encoding!)
        with open(img_path, 'rb') as f:
            image_raw_bytes = f.read()

        content_blocks.append({
            "image": {
                "format": fmt,
                "source": {
                    "bytes": image_raw_bytes
                }
            }
        })

    # Construct the messages structure
    messages = [
        {
            "role": "user",
            "content": content_blocks
        }
    ]

    # Prepare additional kwargs if system prompts are provided
    kwargs = {}
    
    kwargs["system"] = [{"text": system_prompt}]

    # Build the arguments for the `converse` call
    converse_kwargs = {
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 4000,
            "temperature": temperature,
            "topP": top_p
        },
        **kwargs
    }

    
    converse_kwargs["modelId"] = model_id

    # Call the converse API
    try:
        response = bedrock_runtime.converse(**converse_kwargs)
    
        # Parse the assistant response
        assistant_message = response.get('output', {}).get('message', {})
        assistant_content = assistant_message.get('content', [])
        result_text = "".join(block.get('text', '') for block in assistant_content)
    except Exception as e:
        result_text = f"Error message: {e}"
    return result_text

Visual logical reasoning

One of the interesting use cases of vision models is solving logical reasoning problems or visual puzzles. Pixtral 12B vision models are highly capable in solving logical reasoning questions. Let’s explore an example.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are solving logical reasoning problems.'
task = 'Which of these figures differ from the other four?'
image_path = './Pixtral_data/logical_reasoning.jpg'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get following response:

The image shows five figures labeled A, B, C, D, and E. Each figure consists of a head, body, and legs. The task is to identify which figure differs from the others.

**Reasoning:**
- Figure A: The head is connected directly to the body, and the legs are straight.
- Figure B: The head is connected directly to the body, and the legs are straight.
- Figure C: The head is connected directly to the body, and the legs are straight.
- Figure D: The head is connected directly to the body, and the legs are straight.
- Figure E: The head is connected directly to the body, but the legs are bent at the knees.

**Conclusion:** Figure E is the one that differs from the others.

Structured product information

Extracting product information is crucial for the retail industry, especially on sites that host third-party sellers, where product images are the most accessible resource. Accurately capturing relevant details from these images is vital for a product’s success in ecommerce. For instance, using advanced visual models like Pixtral 12B, retailers can efficiently extract key attributes from clothing product images, such as color, style, and patterns. This capability not only streamlines inventory management but also enhances customer experiences by providing essential information that aids in informed purchasing decisions.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='''You are a catalog manager for an ecommerce portal. You have an image of a product. 
Think very carefully in a step-by-step manner to extract product information. 
Always provide response in prescribed JSON format.
'''
image_path = './Pixtral_data/cap.png'
task = '''
Organize product information in JSON format to store in a database. 

Output json schema:

{
"product_name": "",
"description": "",
"category": "",
"sub_category": "",
"color": "",
"size": "",
"brand": "",
"material": "",
"features": [],
"image_alt_text": ""
}
'''

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

Certainly! Based on the image provided, here is the extracted product information in JSON format:

```json
{
  "product_name": "Nike Blue Cap",
  "description": "A stylish blue cap from Nike with a sleek design and adjustable fit.",
  "category": "Apparel",
  "sub_category": "Headwear",
  "color": "Blue",
  "size": "Adjustable",
  "brand": "Nike",
  "material": "Nylon",
  "features": [
    "Adjustable fit",
    "Lightweight",
    "Breathable material",
    "Embroidered Nike logo"
  ],
  "image_alt_text": "A blue Nike cap with a white Nike swoosh logo on the front."
}
```

Vehicle damage assessment

In the insurance industry, image analysis plays a crucial role in claims processing. For vehicle damage assessment, vision models like Pixtral 12B can be used to compare images taken at policy issuance with those submitted during a claim. This approach can streamline the evaluation process, potentially reducing loss adjustment expenses and expediting claim resolution. By automating the identification and characterization of automobile damage, insurers can enhance efficiency, improve accuracy, and ultimately provide a better experience for policyholders.

We use the following input images.

Our prompt and input payload are as follows:

image_paths = [
    "Pixtral_data/car_image_before.png",
    "Pixtral_data/car_image_after.png"
]

system_prompt='''
You are a helpful ai assistant for an insurance agent. 
Insurance agent has received a claim for a vehicle damage. 
'''

task = '''This claim includes two images. 
One of the image was taken before the incident and another was taken after the incident.

Analyse these images and answer below questions:
1. describe if there is any damage to the vehicle
2. should insurance agent accept or reject the claim

'''

print('Input Images:nn')
Image.open(image_paths[0]).show()
Image.open(image_paths[1]).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_paths)

print(f'nResponse from the model:nn{response}')

We get the following response:

### Analysis of Vehicle Damage

#### Image 1 (Before Incident)
- **Vehicle Condition**: The vehicle appears to be in good condition with no visible damage.
- **Paint and Body**: The paint is uniform, and there are no visible scratches, dents, or other forms of damage.
- **Tires and Wheels**: The tires and wheels look intact and in good condition.

#### Image 2 (After Incident)
- **Vehicle Condition**: The vehicle shows noticeable damage.
- **Paint and Body**: There is a visible dent or scratch on the rear side of the vehicle.
- **Tires and Wheels**: The tires and wheels appear to be intact, but the focus is on the body damage.

### Questions and Answers

1. **Describe if there is any damage to the vehicle:**
   - **Before Incident**: No visible damage.
   - **After Incident**: There is visible damage on the rear side of the vehicle, likely a dent or scratch.

2. **Should the insurance agent accept or reject the claim:**
   - **Recommendation**: The insurance agent should accept the claim. The images clearly show that the vehicle sustained damage after the incident. The before and after images provide sufficient evidence of the damage, which was likely caused by an incident.

### Conclusion
Based on the analysis of the images, the insurance agent should accept the claim as the damage to the vehicle is evident and appears to have occurred after the incident.

Handwriting recognition

Another feature in vision language models is their ability to recognize handwriting and extract handwritten text. Pixtral 12B performs well on extracting content from complex and poorly handwritten notes.

We use the following input image.


Our prompt and input payload are as follows:

system_prompt='You are a Graphologists'
task = '''
Analyze the image and transcribe any handwritten text present. 
Convert the handwriting into a single, continuous string of text. 
Maintain the original spelling, punctuation, and capitalization as written. Ignore any printed text, drawings, or other non-handwritten elements in the image.
'''

image_path = './Pixtral_data/a01-000u-04.png'


print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

put down a resolution on the subject

Reasoning of complex figures

VLMs excel at interpreting and reasoning about complex figures, charts, and diagrams. In this particular use case, we use Pixtral 12B to analyze an intricate image containing GDP data. Pixtral 12B’s advanced capabilities in document understanding and complex figure analysis make it well-suited for extracting insights from visual representations of economic data. By processing both the visual elements and accompanying text, Pixtral 12B can provide detailed interpretations and reasoned analysis of the GDP figures presented in the image.

We use the following input image.

Our prompt and input payload are as follows:

system_prompt='You are a Global Economist.'
task = 'List the top 5 countries in Europe with the highest GDP'
image_path = './Pixtral_data/gdp.png'

print('Input Image:nn')
Image.open(image_path).show()

response = call_bedrock_model(model_id=endpoint_arn, 
                   prompt=task, 
                   system_prompt=system_prompt,
                   image_paths = image_path)

print(f'nResponse from the model:nn{response}')

We get the following response:

The top 5 countries in Europe with the highest GDP, based on the provided diagram, are:

1. **Germany**
   - GDP: $3.99 trillion
   - GDP Percentage: 4.65%

2. **United Kingdom**
   - GDP: $2.82 trillion
   - GDP Percentage: 3.29%

3. **France**
   - GDP: $2.78 trillion
   - GDP Percentage: 3.24%

4. **Italy**
   - GDP: $2.07 trillion
   - GDP Percentage: 2.42%

5. **Spain**
   - GDP: $1.43 trillion
   - GDP Percentage: 1.66%

These countries are highlighted in green on the diagram.

Clean up

To avoid unwanted charges, clean up your resources. If you deployed the model using Amazon Bedrock Marketplace, complete the following steps:

Delete the Amazon Bedrock Marketplace deployment

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Marketplace deployments.
  2. In the Managed deployments section, locate the endpoint you want to delete.
  3. Verify the endpoint details to make sure you’re deleting the correct deployment:
    1. Endpoint name
    2. Model name
    3. Endpoint status
  4. Select the endpoint, and choose Delete.
  5. Choose Delete to delete the endpoint.
  6. In the deletion confirmation dialog, review the warning message, enter confirm, and choose Delete to permanently remove the endpoint.

Conclusion

In this post, we showed you how to get started with the Pixtral 12B model in Amazon Bedrock and deploy the model for inference. The Pixtral 12B vision model enables you to solve multiple use cases, including document understanding, logical reasoning, handwriting recognition, image comparison, entity extraction, extraction of structured data from scanned images, and caption generation. These capabilities can drive productivity in a number of enterprise use cases, including ecommerce (retail), marketing, FSI, and much more.

For more Mistral resources on AWS, check out the GitHub repo. The complete code for the samples featured in this post is available on GitHub. Pixtral 12B is also available in Amazon SageMaker JumpStart; refer to Pixtral 12B is now available on Amazon SageMaker JumpStart for details.


About the Authors

Deepesh Dhapola is a Senior Solutions Architect at AWS India, where he assists financial services and fintech clients in scaling and optimizing their applications on the AWS platform. He specializes in core machine learning and generative AI. Outside of work, Deepesh enjoys spending time with his family and experimenting with various cuisines.

Preston Tuggle is a Sr. Specialist Solutions Architect working on generative AI.

Shane Rai is a Principal GenAI Specialist with the AWS World Wide Specialist Organization (WWSO). He works with customers across industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services including model offerings from top tier foundation model providers.

John Liu has 14 years of experience as a product executive and 10 years of experience as a portfolio manager. At AWS, John is a Principal Product Manager for Amazon Bedrock. Previously, he was the Head of Product for AWS Web3 / Blockchain. Prior to AWS, John held various product leadership roles at public blockchain protocols and fintech companies, and also spent 9 years as a portfolio manager at various hedge funds.