Running Evaluations
Execute evaluations on datasets or continuously on live production logs.
Once you have defined evaluation criteria (LLM-as-Judge, Programmatic Rules), you need to run them against your prompt’s logs to generate results. Latitude supports two primary modes for running automated evaluations:
Running Evaluations on Datasets (Run experiment)
Batch evaluations allow you to assess prompt performance across a predefined set of inputs and expected outputs contained within a Dataset.
Use Cases:
- Testing prompt changes against a golden dataset (regression testing).
- Comparing different prompt versions (A/B testing) on the same inputs.
- Evaluating performance on specific edge cases or scenarios defined in the dataset.
- Generating scores for metrics that require ground truth (e.g., Exact Match, Semantic Similarity).
How to Run:
-
Ensure you have a Dataset prepared with relevant inputs (and
expected_output
columns if needed by your evaluation metrics). -
Navigate to the specific Evaluation you want to run (within your prompt’s “Evaluations” tab).
-
Click the “Run experiment” button.
-
Define the experiment variants
-
Select the Dataset you want to run the experiment against.
-
You will be redirected to the experiments tab with the results
Running Evaluations Continuously (Live Mode / Ongoing)
Live evaluations automatically run on new logs as they are generated by your prompt in production (via API calls or the Playground). This provides continuous monitoring of prompt quality.
Use Cases:
- Real-time monitoring of key quality metrics (e.g., validity, safety, basic helpfulness).
- Quickly detecting performance regressions caused by model updates or unexpected inputs.
- Tracking overall prompt performance trends over time.
How to Enable:
- Navigate to the specific Evaluation you want to run live.
- Go to its settings.
- Toggle the “Live Evaluation” option ON.
- Save the settings.
Evaluations requiring an expected_output
(like Exact Match, Lexical Overlap,
Semantic Similarity) or Manual
Evaluations cannot run in live
mode, as they need pre-existing ground truth or human input.
Viewing Evaluation Results
Whether run experiments or live mode, results are accessible:
- Logs View: Individual logs show scores/results from all applicable evaluations that have run on them.
- Evaluations Tab (Per Prompt): View aggregated statistics, score distributions, success rates, and time-series trends for each specific evaluation.
- Experiments: When you run evaluations as experiments, you can view detailed results, compare different variants
These results provide the data needed to understand performance, identify issues, and drive improvements using Prompt Suggestions.
Next Steps
- Learn how to prepare data using Datasets
- Understand how evaluation results power Prompt Suggestions
- Explore the different Evaluation Types