Understand the different ways to evaluate prompt performance in Latitude.
Evaluations are crucial for understanding and improving the quality of your AI prompt responses. Latitude provides a comprehensive evaluation framework to assess performance against various criteria.
How it works: Uses another language model (the “judge”) to score or critique the output of your target prompt based on specific criteria (e.g., helpfulness, clarity, adherence to instructions).
Best for: Subjective criteria, complex assessments, evaluating nuanced qualities like creativity or tone.
Requires: Defining evaluation criteria (often via templates or custom instructions for the judge LLM).
How it works: Applies code-based rules and metrics to check outputs against objective criteria.
Best for: Objective checks, ground truth comparisons (using datasets), format validation (JSON, regex), safety checks (keyword detection), length constraints.
Requires: Defining specific rules (e.g., exact match, contains keyword, JSON schema validation) and potentially providing a Dataset with expected outputs.
How it works: Team members manually review prompt outputs (logs) and assign scores or labels based on their judgment.
Best for: Capturing nuanced human preferences, evaluating criteria difficult for LLMs to judge, initial quality assessment, creating golden datasets for other evaluation types.
Requires: Setting up manual review workflows and criteria for reviewers.
Per-Prompt Basis: Evaluations are configured individually for each prompt within a project.
Target Logs: Evaluations run on the Logs generated by their associated prompt.
Triggering: Evaluations can be run manually on batches of logs/datasets or automatically on incoming logs (live mode). See Running Evaluations.
Results: Evaluation results (scores, labels, feedback) are stored alongside the corresponding logs, providing a rich dataset for analysis and improvement.
By default the actual output is the last assistant message in the conversation, parsed as a simple string. However, some use cases requires more complex parsing, like evaluating tool calling or middle CoT.
Go to the evaluation’s settings by clicking on the right-side button in the evaluation’s dashboard:
Click on Advanced configuration.
Configure:
Message selection: The last message or all messages in the conversation.
Content filter: Filter the messages by content type (e.g., text, images, tool calls…).
Parsing format: The format to parse the actual output into (e.g., string, JSON, YAML…).
Field accessor: The field to access in the actual output (e.g., ‘answer’, ‘arguments.recommendations[2]’…).
Test the configuration by clicking on the “Test” button:
The expected output, also known as label, refers to the correct or ideal response that the language model should generate for a given prompt. You can create Datasets with Expected Output Columns to evaluate prompts with ground truth.
Sometimes, you want to measure undesirable traits (e.g., toxicity, hallucination presence), where a lower score is better. Latitude allows you to mark evaluations as “negative”.
Go to the evaluation’s settings by clicking on the right-side button in the evaluation’s dashboard:
Click on Advanced configuration.
Select “optimize for a lower score” to indicate high scores are undesirable:
The Prompt Suggestions feature will use this setting to optimize correctly.