Execute evaluations on datasets or continuously on live production logs.
Once you have defined evaluation criteria (LLM-as-Judge, Programmatic Rules), you need to run them against your prompt’s logs to generate results. Latitude supports two primary modes for running automated evaluations:
Batch evaluations allow you to assess prompt performance across a predefined set of inputs and expected outputs contained within a Dataset.
Use Cases:
How to Run:
Ensure you have a Dataset prepared with relevant inputs (and expected_output
columns if needed by your evaluation metrics).
Navigate to the specific Evaluation you want to run (within your prompt’s “Evaluations” tab).
Click the “Run experiment” button.
Define the experiment variants
Select the Dataset you want to run the experiment against.
You will be redirected to the experiments tab with the results
Live evaluations automatically run on new logs as they are generated by your prompt in production (via API calls or the Playground). This provides continuous monitoring of prompt quality.
Use Cases:
How to Enable:
Evaluations requiring an expected_output
(like Exact Match, Lexical Overlap,
Semantic Similarity) or Manual
Evaluations cannot run in live
mode, as they need pre-existing ground truth or human input.
Whether run experiments or live mode, results are accessible:
These results provide the data needed to understand performance, identify issues, and drive improvements using Prompt Suggestions.