Evaluations
Humans-in-the-Loop
Incorporate manual reviews and direct human feedback into your evaluation workflow.
Human-in-the-Loop (HITL), or Manual Evaluations, involve direct human review and assessment of prompt outputs. This method is essential for capturing nuanced judgments, user preferences, and criteria that are difficult for automated systems to evaluate.
How it Works
- Target Prompt Run: Your main prompt generates an output (log).
- Evaluation Setup: You define a HITL evaluation in Latitude, specifying what kind of feedback is needed (e.g., a numeric score, a boolean flag, a category label, free-form text).
- Manual Review: A team member (or designated reviewer) examines the prompt’s input and output within a specific log entry.
- Feedback Submission: The reviewer submits their judgment (score, label, text feedback) for that log through:
- Result Storage: The human-provided evaluation result is stored alongside the corresponding log.
Because HITL evaluations require manual input, they do not support automatic live or batch execution like LLM-as-Judge or Programmatic Rules. Feedback must be submitted individually for each log reviewed.
Use Cases
Manual evaluations are best for:
- Nuanced Judgments: Assessing qualities like subtle tone issues, specific domain accuracy, or overall user satisfaction that automated methods might miss.
- Capturing User Feedback: Integrating direct feedback (e.g., thumbs up/down, ratings) collected from end-users of your application.
- Ground Truth Creation: Building a “golden dataset” of high-quality, human-verified examples that can be used to train or test automated evaluation methods (like LLM-as-Judge or fine-tuning models).
- Safety & Compliance Reviews: Having humans verify adherence to critical safety or ethical guidelines.
- Exploratory Analysis: Understanding failure modes or quality issues before setting up automated evaluations.
Trade-offs
- Scalability: Cannot be fully automated; requires manual effort for each review.
- Time & Cost: Can be slower and more resource-intensive than automated methods.
- Subjectivity & Consistency: Judgments can vary between reviewers. Clear guidelines and reviewer calibration are important.
How to Perform Manual Reviews & Capture Feedback
1. Setting Up the HITL Evaluation
- Navigate to your prompt’s “Evaluations” tab.
- Click “New Evaluation”.
- Choose “Human-in-the-Loop” as the type.
- Configure the expected feedback type: Numeric (define range), Boolean, Text (define categories or allow free text).
- Provide clear instructions or criteria for reviewers if needed (in the description field).
2. Reviewing Logs in the Dashboard
- Go to the “Logs” section or the specific HITL evaluation’s dashboard.
- Select a log entry to review.
- Examine the prompt input, output, and context.
- Find the section for the HITL evaluation.
- Enter the score, select the label, or type the feedback according to the evaluation setup.
- Submit the evaluation result for that log.
3. Capturing Feedback via API/SDK
- Use the Latitude API or SDK endpoints/methods for submitting evaluation results.
- Pass the
log_id
, theevaluation_id
(for the specific HITL evaluation), and the evaluationresult
(score, boolean, text). - This allows you to build custom review interfaces or pipe in feedback collected directly from your application’s users.
Viewing Evaluation Results
Manually submitted results appear alongside other evaluation results:
- Logs View: Attached to the individual log entry.
- Evaluations Tab: Aggregated statistics and distributions for the HITL evaluation.
Next Steps
- Learn about LLM-as-Judge Evaluations
- Explore Programmatic Rule Evaluations
- Understand how to Run Automated Evaluations