Evaluations are crucial for understanding and improving the quality of your AI prompt responses. Latitude provides a comprehensive evaluation framework to assess performance against various criteria.

Why Evaluate Prompts?

  • Measure Quality: Objectively assess if prompts meet desired standards (accuracy, relevance, tone, safety, etc.).
  • Identify Weaknesses: Pinpoint scenarios where prompts underperform.
  • Compare Versions: Quantify the impact of prompt changes (A/B testing).
  • Drive Improvement: Gather data to refine prompts using Prompt Suggestions.
  • Ensure Reliability: Build confidence in production-deployed prompts.

Evaluation Types in Latitude

Latitude supports three main approaches to evaluation, each suited for different needs:

  1. LLM-as-Judge:

    • How it works: Uses another language model (the “judge”) to score or critique the output of your target prompt based on specific criteria (e.g., helpfulness, clarity, adherence to instructions).
    • Best for: Subjective criteria, complex assessments, evaluating nuanced qualities like creativity or tone.
    • Requires: Defining evaluation criteria (often via templates or custom instructions for the judge LLM).
  2. Programmatic Rules:

    • How it works: Applies code-based rules and metrics to check outputs against objective criteria.
    • Best for: Objective checks, ground truth comparisons (using datasets), format validation (JSON, regex), safety checks (keyword detection), length constraints.
    • Requires: Defining specific rules (e.g., exact match, contains keyword, JSON schema validation) and potentially providing a Dataset with expected outputs.
  3. Human-in-the-Loop (Manual Evaluations):

    • How it works: Team members manually review prompt outputs (logs) and assign scores or labels based on their judgment.
    • Best for: Capturing nuanced human preferences, evaluating criteria difficult for LLMs to judge, initial quality assessment, creating golden datasets for other evaluation types.
    • Requires: Setting up manual review workflows and criteria for reviewers.

How Evaluations Connect to Prompts

  • Per-Prompt Basis: Evaluations are configured individually for each prompt within a project.
  • Target Logs: Evaluations run on the Logs generated by their associated prompt.
  • Triggering: Evaluations can be run manually on batches of logs/datasets or automatically on incoming logs (live mode). See Running Evaluations.
  • Results: Evaluation results (scores, labels, feedback) are stored alongside the corresponding logs, providing a rich dataset for analysis and improvement.

Actual Outputs

The actual output is the generated output from the model conversation. This is the output to perform evaluations against.

Selecting the Actual Output to evaluate against

By default the actual output is the last assistant message in the conversation, parsed as a simple string. However, some use cases requires more complex parsing, like evaluating tool calling or middle CoT.

  1. Go to the evaluation’s settings by clicking on the right-side button in the evaluation’s dashboard:
  2. Click on Advanced configuration.
  3. Configure:
    • Message selection: The last message or all messages in the conversation.
    • Content filter: Filter the messages by content type (e.g., text, images, tool calls…).
    • Parsing format: The format to parse the actual output into (e.g., string, JSON, YAML…).
    • Field accessor: The field to access in the actual output (e.g., ‘answer’, ‘arguments.recommendations[2]’…).
  4. Test the configuration by clicking on the “Test” button:

Expected Outputs

The expected output, also known as label, refers to the correct or ideal response that the language model should generate for a given prompt. You can create Datasets with Expected Output Columns to evaluate prompts with ground truth.

Negative Evaluations

Sometimes, you want to measure undesirable traits (e.g., toxicity, hallucination presence), where a lower score is better. Latitude allows you to mark evaluations as “negative”.

  1. Go to the evaluation’s settings by clicking on the right-side button in the evaluation’s dashboard:
  2. Click on Advanced configuration.
  3. Select “optimize for a lower score” to indicate high scores are undesirable:

The Prompt Suggestions feature will use this setting to optimize correctly.

Next Steps

Dive deeper into each evaluation type: