Vision fine-tuning

Article
03/04/2025

Fine-tuning is also possible with images in your JSONL files. Just as you can send one or many image inputs to chat completions, you can include those same message types within your training data. Images can be provided either as publicly accessible URLs or data URIs containing base64 encoded images.

Model support

Vision fine-tuning is supported for gpt-4o version 2024-08-06 models only.

Image dataset requirements

Your training file can contain a maximum of 50,000 examples that contain images (not including text examples).
Each example can have at most 64 images.
Each image can be at most 10 MB.

Format

Images must be:

JPEG
PNG
WEBP

Images must be in the RGB or RGBA image mode.

You cannot include images as output from messages with the assistant role.

As with all fine-tuning training your example file requires at least 10 examples.

Example file format

{
  "messages": [
    { "role": "system", "content": "You are a helpful AI assistant." },
    { "role": "user", "content": "Describe the image?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://raw.githubusercontent.com/MicrosoftDocs/azure-ai-docs/main/articles/ai-services/openai/media/how-to/generated-seattle.png"
          }
        }
      ]
    },
    { "role": "assistant", "content": "The image appears to be a watercolor painting of a city skyline, featuring tall buildings and a recognizable structure often associated with Seattle, like the Space Needle. The artwork uses soft colors and brushstrokes to create a somewhat abstract and artistic representation of the cityscape" }
  ]
}

Content moderation policy

We scan your images before training to ensure that they comply with our usage policy Transparency Note. This may introduce latency in file validation before fine tuning begins.

Images containing the following will be excluded from your dataset and not used for training:

People
Faces
CAPTCHAs

Important

For vision fine tuning face screening process: We screen for faces/people to skip those images from training the model. The screening capability leverages face detection WITHOUT Face identification which means we don't create facial templates or measure specific facial geometry, and the technology used to screen for faces is incapable of uniquely identifying the individuals. To know more about data and Privacy for face refer to - Data and privacy for Face - Azure AI services | Microsoft Learn.

Next steps

Deploy a finetuned model.
Review fine-tuning model regional availability
Learn more about Azure OpenAI quotas

Share via