Execute evaluations on datasets or continuously on live production logs.
expected_output
columns if needed by your evaluation metrics).
expected_output
(like Exact Match, Lexical Overlap,
Semantic Similarity) or Manual
Evaluations cannot run in live
mode, as they need pre-existing ground truth or human input.