LLM-as-Judges
Use language models to evaluate the quality, style, and correctness of prompt outputs.
LLM-as-Judge evaluations leverage the reasoning capabilities of language models to assess the quality of your target prompt’s outputs based on defined criteria. This approach is powerful for evaluating subjective, complex, or nuanced aspects that are hard to capture with simple rules.
How it Works
- Target Prompt Run: Your main prompt generates an output (log).
- Evaluation Trigger: The LLM-as-Judge evaluation is run on that log.
- Judge Prompt Execution: Latitude sends the original input, the target prompt’s output, and your evaluation prompt (which defines the criteria) to a separate “judge” LLM.
- Judgment: The judge LLM analyzes the target output based on the criteria and returns a score, classification, or textual feedback.
- Result Storage: The judgment is stored alongside the original log.
Use Cases
LLM-as-Judge is ideal for:
- Subjective Qualities: Evaluating helpfulness, coherence, creativity, tone, safety, faithfulness to instructions.
- Complex Criteria: Assessing responses based on multiple factors or contextual understanding.
- Flexible Outputs: Evaluating scenarios where valid responses can vary significantly.
- Natural Language Feedback: Generating detailed textual feedback on why an output is good or bad.
Trade-offs
- Cost: Each evaluation run consumes tokens from the judge model provider.
- Latency: Can be slower than programmatic rules.
- Consistency: Judge LLM responses might have slight variations between runs.
- Potential Bias: The judge LLM might have its own biases.
For objective checks (format, keywords), consider Programmatic Rules. For direct human feedback, use Manual Evaluations.
Setting Up an LLM-as-Judge Evaluation
- Navigate to your prompt’s “Evaluations” tab.
- Click “New Evaluation”.
- Choose “LLM as Judge” as the type.
- Configure the Judge: Select the AI Provider and Model that will act as the judge for this evaluation.
- Define Criteria: Choose either an Evaluation Template or create a Custom Evaluation.
Using Evaluation Templates
Latitude provides pre-built templates for common evaluation criteria (e.g., Factuality, Conciseness, Toxicity). Using a template automatically populates the evaluation prompt for the judge LLM.
- Select “Use a template” during setup.
- Browse and choose the template that fits your needs.
- You can often customize the template prompt further.
See the full list of available templates.
Creating Custom Evaluations
For unique criteria, create a custom evaluation from scratch.
- Select “Custom Evaluation” during setup.
- Choose the desired output type for the judgment:
- Numeric: For scores within a range.
- Boolean: For simple pass/fail checks.
- Text: For categorical labels or free-form feedback.
- Write the Evaluation Prompt: This is the crucial step. You need to write a prompt that instructs the judge LLM how to evaluate the target prompt’s output. Include:
- Clear criteria for evaluation.
- Placeholders for the target prompt’s input (
{{input}}
) and output ({{output}}
). - Instructions for the desired JSON output format based on the chosen type.
Custom Evaluation Output Formats
Your evaluation prompt must instruct the judge LLM to return JSON in one of these formats:
1. Numeric Evaluation:
- Use Case: Scoring on a scale (e.g., 0-10).
- Setup: Define min/max values (e.g., 0 and 10).
- Judge Prompt Instructions: Tell the judge to evaluate based on criteria and return a score within the defined range.
- Required JSON Output:
- Example Prompt Snippet: ”…Evaluate the output on a scale from 0 to 10 based on its helpfulness. Return your score and reasoning in the specified JSON format.”
2. Boolean Evaluation:
- Use Case: Checking if a specific condition is met (true/false).
- Setup: No specific range needed.
- Judge Prompt Instructions: Tell the judge to check for a specific condition.
- Required JSON Output:
- Example Prompt Snippet: ”…Does the provided output directly answer the user’s question? Return true or false and your reasoning in the specified JSON format.”
3. Text Evaluation:
- Use Case: Categorization or detailed qualitative feedback.
- Setup: Define allowed category labels (optional) or allow free text.
- Judge Prompt Instructions: Tell the judge to provide a textual assessment or category.
- Required JSON Output:
- Example Prompt Snippet (Categorization): ”…Classify the output’s tone as ‘Formal’, ‘Informal’, or ‘Neutral’. Return the label and reasoning in the specified JSON format.”
- Example Prompt Snippet (Feedback): ”…Provide detailed feedback on the clarity and coherence of the output. Return your feedback in the specified JSON format.”
Best Practices for Custom Judge Prompts
- Be Specific: Clearly define the criteria.
- Use Examples (Few-Shot): Include examples of good/bad outputs and desired judgments in the prompt.
- Explicitly Request JSON: Remind the judge model to output in the correct JSON format.
- Iterate: Test your evaluation prompt in the playground or on sample logs.
Viewing Evaluation Results
Once evaluations run (either live or in batch), you can see the results:
- Logs View: Each log entry will show the scores/results from applicable evaluations.
- Evaluations Tab: View aggregated results, score distributions, and trends for each evaluation over time.
- Prompt Suggestions: Evaluation results feed into the Prompt Suggestions feature.
Next Steps
- Explore Programmatic Rule Evaluations
- Learn about Manual Evaluations
- Understand how to Run Evaluations