Incorporate manual reviews and direct human feedback into your evaluation workflow.
Human-in-the-Loop (HITL) involve direct human review and assessment of prompt outputs. This method is essential for capturing nuanced judgments, user preferences, and criteria that are difficult for automated systems to evaluate.
How it works: Team members manually review prompt outputs (logs) and assign scores or labels based on their judgment.
Best for: Capturing nuanced human preferences, evaluating criteria difficult for LLMs to judge, initial quality assessment, creating golden datasets for other evaluation types.
Requires: Setting up manual review workflows and criteria for reviewers.
Because HITL evaluations require manual input, they do not support automatic
live or batch execution like LLM-as-Judge or Programmatic Rules. Feedback
must be submitted individually for each log reviewed.
Check how to annotate a log. A log is the result of running your prompt. So the person can annotate that result and tell if it was good or bad, or provide a score.