Multi-Modal Prompting

Multi-modal prompting is a technique that enables AI models to process and respond to multiple types of input media, such as combining text with images, audio, or documents. Unlike traditional text-only prompts, multi-modal prompting allows for richer interactions by incorporating visual context, enabling applications like image analysis, document processing, and visual reasoning.

Enhanced Context: Visual content provides information that’s difficult to express in text alone
Improved Accuracy: Models can “see” what they’re analyzing rather than relying on descriptions
Complex Reasoning: Combines visual perception with textual reasoning for sophisticated tasks
Natural Interaction: Mimics human ability to process multiple sensory inputs simultaneously
Application Versatility: Enables new use cases like visual QA, document analysis, and content moderation
Reduced Ambiguity: Images provide concrete references that minimize misinterpretations

Basic Implementation in Latitude

Here’s a simple multi-modal prompting example using Latitude:

Image Analysis

---
provider: OpenAI
model: gpt-4o
temperature: 0.7
parameters:
  input_image:
    type: image
    description: "Image to analyze"
  analysis_request:
    type: text
    description: "What aspect of the image to analyze"
---

# Image Analysis Assistant

I'll help you analyze the provided image in detail.

## Image:
{{ input_image }}

## Analysis Request:
{{ analysis_request }}

## Comprehensive Analysis:
I'll examine the image carefully and provide insights based on your request.

In the Playground, the input_image parameter will automatically be configured with an image upload button, while the analysis_request will be a standard text field. This configuration is based on the parameter types specified in the YAML header.

Advanced Implementation with Multiple Media Types

This example shows how to work with both images and text in a more sophisticated workflow:

---
provider: OpenAI
model: gpt-4o
temperature: 0.5
parameters:
  product_image:
    type: image
    description: "Product image to analyze"
  product_description:
    type: text
    description: "Text description of the product"
---

<step>
# Initial Image Analysis

## Image:
{{ product_image }}

## Task:
Analyze this product image thoroughly. Focus on:
1. What type of product it is
2. Key visible features
3. Visual design elements
4. Target demographic suggested by the styling
5. Overall quality impression

## Initial Analysis:
</step>

<step>
# Comparative Assessment

## Product Image:
{{ product_image }}

## Product Description:
{{ product_description }}

## Comparison Task:
Compare the actual product image with the provided product description.
Identify:
1. Features mentioned in the description that are clearly visible in the image
2. Features mentioned but not clearly visible
3. Visual aspects not mentioned in the description
4. Any inconsistencies between the image and description

## Comparative Analysis:
</step>

<step>
# Final Recommendations

Based on the image analysis and comparison with the description:

## Product Enhancement Suggestions:
1. Product presentation improvements
2. Description alignment recommendations
3. Target audience considerations
4. Visual marketing opportunities

## Final Recommendations:
</step>

Latitude offers several ways to configure and use multi-modal parameters:

Parameter Configuration in YAML

You can define parameter types directly in your prompt’s configuration YAML:

parameters:
  product_image:
    type: image
    description: "Product image to analyze"
  data_file:
    type: file
    description: "Document to process"
  text_input:
    type: text
    description: "Standard text prompt"

Parameter Input Methods in the Playground

When testing multi-modal prompts in the Playground, you can use these input methods:

Manual Upload: Directly upload images and files through the parameter input fields
Dataset Integration: Load images from a dataset for batch testing across multiple visual examples
History Reuse: Access previously used images from your parameter history

For detailed information, see the Playground Parameter Input Methods guide. When testing multi-modal prompts in the Latitude Playground, you’ll encounter specific controls for image and file inputs:

Image Parameters:
- Click the upload button to select an image from your device
- Optionally, use the image preview to verify you’ve selected the correct file
- For vision models, images will be appropriately encoded and embedded in the prompt
File Parameters:
- Upload PDF documents or other supported file types
- The Playground will process these files according to the provider’s requirements
- Some providers may have file size or type limitations
Dataset Testing:
- Create datasets that include image or file URLs for batch testing
- Test your multi-modal prompts across a variety of visual inputs
- Compare performance with different visual content types

For complex multi-step prompts with visual components, you can use the Playground to observe how the model processes visual information at each step of the chain. This is particularly useful for debugging visual reasoning flows.

Document Processing Implementation

This example demonstrates how to process document files (PDFs) using multi-modal capabilities:

---
provider: Anthropic
model: claude-3-opus-20240229
temperature: 0.2
parameters:
  input_document:
    type: file
    description: "PDF or document file to analyze"
  analysis_request:
    type: text
    description: "What to extract or analyze from the document"
  extraction_fields:
    type: text
    description: "Optional comma-separated list of fields to extract"
    required: false
---

# Document Analysis Assistant

I'll help you extract and analyze information from the provided document.

## Document:
{{ input_document }}

## Analysis Request:
{{ analysis_request || "Please summarize the key points of this document." }}

## Document Analysis:
I'll carefully review the document and provide a thorough analysis based on your request.

{{ if extraction_fields }}
## Specific Information Extraction:
I'll extract the following specific fields from the document:
{{ for field in extraction_fields }}
- {{ field }}
{{ endfor }}
{{ endif }}

Image Input Optimization

Image Quality Guidelines:

Resolution: Provide images with sufficient resolution (min. 512px on shortest side)
Clarity: Ensure images are well-lit and in focus
Framing: Center the relevant subject in the frame
Compression: Minimize compression artifacts that might affect analysis
Aspect Ratio: Use standard aspect ratios when possible

Image Type Considerations:

Photographs: Work best for general object/scene identification
Screenshots: Useful for UI analysis, but text may be analyzed separately
Diagrams/Charts: Effective for technical analysis
Illustrations: May be interpreted differently than photographs

Prompt Structure

Text-Image Integration:

Place image reference before the main question/instruction
Clearly separate image and text with formatting
Reference the image explicitly in your instructions
Be precise about what aspect of the image to focus on

Question Framing:

Ask specific questions rather than open-ended ones
Break complex visual tasks into simpler components
Use bullet points for multiple questions about the same image
Provide context before questions to prime the model’s attention

Task-Specific Guidance

Visual Classification:

Specify the taxonomy or categories you want used
For ambiguous cases, ask for confidence levels
Request alternative classifications when appropriate

Visual QA:

Frame questions to focus attention on relevant image areas
Provide examples of desired answer format/detail level
For subjective questions, specify the perspective to take

Description Generation:

Indicate desired detail level (brief vs. comprehensive)
Specify aspects to focus on (colors, shapes, text, etc.)
Request structured output for consistent descriptions

Technical Considerations

Model Selection:

Choose models specifically designed for multi-modal tasks
Consider context window limitations when sending large images
Test smaller/lighter models for simpler visual tasks

Performance Optimization:

Resize very large images before sending
Consider black and white conversion for non-color tasks
Use image cropping to focus on relevant regions
Test with multiple prompt variations for optimal results

Advanced Techniques

Visual Chain-of-Thought

Implement a step-by-step visual reasoning process:

Visual CoT

---
provider: OpenAI
model: gpt-4o
temperature: 0.4
parameters:
  problem_image:
    type: image
    description: "Visual problem or puzzle to solve"
  reasoning_task:
    type: text
    description: "Specific reasoning task to perform on the image"
---

# Visual Reasoning Task

I'll solve this visual problem step by step.

## Image:
{{ problem_image }}

## Task:
{{ reasoning_task }}

## Visual Chain-of-Thought:
1. **Initial Observation**: First, I'll carefully observe what's in the image...
2. **Identify Key Elements**: Next, I'll identify the specific elements relevant to this problem...
3. **Analyze Relationships**: Then, I'll analyze how these elements relate to each other...
4. **Apply Reasoning**: Based on these observations, I'll apply reasoning to solve the task...
5. **Final Answer**: Therefore, my conclusion is...

Multi-Image Comparison

Compare and analyze multiple images simultaneously:

---
provider: OpenAI
model: gpt-4o
temperature: 0.3
parameters:
  first_image:
    type: image
    description: "First image for comparison"
  second_image:
    type: image
    description: "Second image for comparison"
  comparison_request:
    type: text
    description: "Specific aspects to compare between images"
    default: "Compare these two images and describe key differences and similarities."
---

# Multi-Image Comparison Analysis

I'll compare the provided images and analyze the differences and similarities.

## Image 1:
{{ first_image }}

## Image 2:
{{ second_image }}

## Comparison Task:
{{ comparison_request || "Compare these two images and describe key differences and similarities." }}

## Comparative Analysis:
### Similarities:
[I'll list the common elements between both images]

### Differences:
[I'll detail the key differences I observe]

### Significance:
[I'll explain the importance of these differences in the context of your request]

Visual Information Extraction

Extract structured data from visually rich content:

Visual Extraction

---
provider: OpenAI
model: gpt-4o
temperature: 0.2
type: json
parameters:
  source_image:
    type: image
    description: "Image containing information to extract"
  extraction_request:
    type: text
    description: "What specific information to extract from the image"
schema:
  type: object
  required: ["extracted_data", "confidence_level"]
  properties:
    extracted_data:
      type: object
      description: "Data extracted from the image according to the extraction request"
    confidence_level:
      type: string
      enum: ["very low", "low", "medium", "high", "very high"]
      description: "Confidence level in the extracted information"
    reasoning:
      type: string
      description: "Brief explanation of how the extraction was performed and any challenges encountered"
---

# Visual Information Extraction

Extract specific information from the provided image.

## Image:
{{ source_image }}

## Extraction Request:
{{ extraction_request }}

## Analysis:
I'll examine the image carefully to extract the requested information and provide it in a structured format.

Multi-modal prompting works well when combined with other prompting techniques:

Chain-of-Thought: Break down visual reasoning into explicit steps for complex image analysis.
Self-Consistency: Generate multiple interpretations of an image and select the most consistent one.
Template-Based Prompting: Use templates to standardize visual analysis across different images.
Retrieval-Augmented Generation: Combine image analysis with retrieved textual information for contextual understanding.
Few-Shot Learning: Provide examples of image-text pairs to guide the model’s visual interpretation.

Real-World Applications

Multi-modal prompting is particularly valuable in these domains:

Content Moderation: Analyzing images for policy violations or inappropriate content
E-commerce: Automated product photo analysis, comparison, and description generation
Healthcare: Reviewing medical images alongside patient records (with appropriate regulatory compliance)
Document Processing: Extracting information from forms, receipts, and ID documents
Accessibility: Generating detailed image descriptions for vision-impaired users
Education: Creating interactive learning experiences with visual elements
Quality Control: Inspecting products or materials for defects or compliance issues

Parameter Definition Options

Parameters in multi-modal prompts can be configured with several advanced options:

parameters:
  product_image:
    type: image                  # Parameter type: image, file, or text
    description: "Product image" # Help text shown in Playground
  document_file:
    type: file
    description: "PDF document to analyze"
  extraction_fields:
    type: text
    description: "Fields to extract (comma separated)"
    default: "date, invoice number, total amount"

Latitude supports these parameter types for multi-modal prompting:

Image Parameters (type: image):
- Supported by models with vision capabilities (e.g., GPT-4o, Claude 3)
- Rendered in the prompt using {{ parameter_name }}
- Appears as an image upload button in the Playground
- Most models support common formats: JPEG, PNG, WebP, GIF
File Parameters (type: file):
- Supported by models with document processing capabilities
- Different providers support different file types:
  - Claude: PDF documents
  - GPT-4o: Various document formats
- Enables document analysis, data extraction, and PDF processing
Text Parameters (type: text):
- Standard text input that can be used alongside multi-modal inputs
- Can contain instructions for processing the visual content

For complete details on parameter configuration, refer to the Configuration guide. When working with multi-modal content in Latitude’s PromptL syntax, you can reference images and files directly in your messages:

# Image Analysis

<user>
Please analyze this image: <content-image src="{{ image_parameter }}" />
What can you tell me about it?
</user>

<assistant>
I'll analyze the image you provided.
...
</assistant>

Or for document content:

<user>
Review this document: <content-file src="{{ document_parameter }}" />
Summarize the key points.
</user>

This explicit content tag syntax is an alternative to direct variable insertion and works well in more complex prompt structures.

Overview

Prompting Techniques

SDK examples

Use cases

Basic Implementation in Latitude

Advanced Implementation with Multiple Media Types

Parameter Configuration in YAML

Parameter Input Methods in the Playground

Document Processing Implementation

Advanced Techniques

Visual Chain-of-Thought

Multi-Image Comparison

Visual Information Extraction

Real-World Applications

Parameter Definition Options

Overview

Prompting Techniques

SDK examples

Use cases

​What is Multi-Modal Prompting?

​Why Use Multi-Modal Prompting?

​Basic Implementation in Latitude

​Advanced Implementation with Multiple Media Types

​Setting Up Multi-Modal Parameters

​Parameter Configuration in YAML

​Parameter Input Methods in the Playground

​Working with Multi-Modal Inputs in the Playground

​Document Processing Implementation

​Best Practices for Multi-Modal Prompting

​Advanced Techniques

​Visual Chain-of-Thought

​Multi-Image Comparison

​Visual Information Extraction

​Related Techniques

​Real-World Applications

​Advanced Configuration for Multi-Modal Parameters

​Parameter Definition Options

​Parameter Types for Multi-Modal Inputs

​Multi-Modal Content in PromptL

What is Multi-Modal Prompting?

Why Use Multi-Modal Prompting?

Basic Implementation in Latitude

Advanced Implementation with Multiple Media Types

Setting Up Multi-Modal Parameters

Parameter Configuration in YAML

Parameter Input Methods in the Playground

Working with Multi-Modal Inputs in the Playground

Document Processing Implementation

Best Practices for Multi-Modal Prompting

Advanced Techniques

Visual Chain-of-Thought

Multi-Image Comparison

Visual Information Extraction

Related Techniques

Real-World Applications

Advanced Configuration for Multi-Modal Parameters

Parameter Definition Options

Parameter Types for Multi-Modal Inputs

Multi-Modal Content in PromptL