Evaluations
Overview
Understand the different ways to evaluate prompt performance in Latitude.
Evaluations are crucial for understanding and improving the quality of your AI prompt responses. Latitude provides a comprehensive evaluation framework to assess performance against various criteria.
Why Evaluate Prompts?
- Measure Quality: Objectively assess if prompts meet desired standards (accuracy, relevance, tone, safety, etc.).
- Identify Weaknesses: Pinpoint scenarios where prompts underperform.
- Compare Versions: Quantify the impact of prompt changes (A/B testing).
- Drive Improvement: Gather data to refine prompts using Prompt Suggestions.
- Ensure Reliability: Build confidence in production-deployed prompts.
Evaluation Types in Latitude
Latitude supports three main approaches to evaluation, each suited for different needs:
-
- How it works: Uses another language model (the “judge”) to score or critique the output of your target prompt based on specific criteria (e.g., helpfulness, clarity, adherence to instructions).
- Best for: Subjective criteria, complex assessments, evaluating nuanced qualities like creativity or tone.
- Requires: Defining evaluation criteria (often via templates or custom instructions for the judge LLM).
-
- How it works: Applies code-based rules and metrics to check outputs against objective criteria.
- Best for: Objective checks, ground truth comparisons (using datasets), format validation (JSON, regex), safety checks (keyword detection), length constraints.
- Requires: Defining specific rules (e.g., exact match, contains keyword, JSON schema validation) and potentially providing a Dataset with expected outputs.
-
Human-in-the-Loop (Manual Evaluations):
- How it works: Team members manually review prompt outputs (logs) and assign scores or labels based on their judgment.
- Best for: Capturing nuanced human preferences, evaluating criteria difficult for LLMs to judge, initial quality assessment, creating golden datasets for other evaluation types.
- Requires: Setting up manual review workflows and criteria for reviewers.
How Evaluations Connect to Prompts
- Per-Prompt Basis: Evaluations are configured individually for each prompt within a project.
- Target Logs: Evaluations run on the Logs generated by their associated prompt.
- Triggering: Evaluations can be run manually on batches of logs/datasets or automatically on incoming logs (live mode). See Running Evaluations.
- Results: Evaluation results (scores, labels, feedback) are stored alongside the corresponding logs, providing a rich dataset for analysis and improvement.
Negative Evaluations
Sometimes, you want to measure undesirable traits (e.g., toxicity, hallucination presence), where a lower score is better. Latitude allows you to mark evaluations as “negative”.
- Go to the evaluation’s settings.
- Configure whether a higher or lower score indicates better performance.
- The Prompt Suggestions feature will use this setting to optimize correctly.
Next Steps
Dive deeper into each evaluation type: