Evaluations are crucial for understanding and improving the quality of your AI prompt responses. Latitude provides a comprehensive evaluation framework to assess performance against various criteria.

Why Evaluate Prompts?

  • Measure Quality: Objectively assess if prompts meet desired standards (accuracy, relevance, tone, safety, etc.).
  • Identify Weaknesses: Pinpoint scenarios where prompts underperform.
  • Compare Versions: Quantify the impact of prompt changes (A/B testing).
  • Drive Improvement: Gather data to refine prompts using Prompt Suggestions.
  • Ensure Reliability: Build confidence in production-deployed prompts.

Evaluation Types in Latitude

Latitude supports three main approaches to evaluation, each suited for different needs:

  1. LLM-as-Judge:

    • How it works: Uses another language model (the “judge”) to score or critique the output of your target prompt based on specific criteria (e.g., helpfulness, clarity, adherence to instructions).
    • Best for: Subjective criteria, complex assessments, evaluating nuanced qualities like creativity or tone.
    • Requires: Defining evaluation criteria (often via templates or custom instructions for the judge LLM).
  2. Programmatic Rules:

    • How it works: Applies code-based rules and metrics to check outputs against objective criteria.
    • Best for: Objective checks, ground truth comparisons (using datasets), format validation (JSON, regex), safety checks (keyword detection), length constraints.
    • Requires: Defining specific rules (e.g., exact match, contains keyword, JSON schema validation) and potentially providing a Dataset with expected outputs.
  3. Human-in-the-Loop (Manual Evaluations):

    • How it works: Team members manually review prompt outputs (logs) and assign scores or labels based on their judgment.
    • Best for: Capturing nuanced human preferences, evaluating criteria difficult for LLMs to judge, initial quality assessment, creating golden datasets for other evaluation types.
    • Requires: Setting up manual review workflows and criteria for reviewers.

How Evaluations Connect to Prompts

  • Per-Prompt Basis: Evaluations are configured individually for each prompt within a project.
  • Target Logs: Evaluations run on the Logs generated by their associated prompt.
  • Triggering: Evaluations can be run manually on batches of logs/datasets or automatically on incoming logs (live mode). See Running Evaluations.
  • Results: Evaluation results (scores, labels, feedback) are stored alongside the corresponding logs, providing a rich dataset for analysis and improvement.

Negative Evaluations

Sometimes, you want to measure undesirable traits (e.g., toxicity, hallucination presence), where a lower score is better. Latitude allows you to mark evaluations as “negative”.

  • Go to the evaluation’s settings.
  • Configure whether a higher or lower score indicates better performance.
  • The Prompt Suggestions feature will use this setting to optimize correctly.

Next Steps

Dive deeper into each evaluation type: