Running Evaluations

Once you have defined evaluation criteria (LLM-as-Judge, Programmatic Rules), you need to run them against your prompt’s logs to generate results. Latitude supports two primary modes for running automated evaluations:

Running Evaluations on Datasets (Run experiment)

Batch evaluations allow you to assess prompt performance across a predefined set of inputs and expected outputs contained within a Dataset.

Use Cases:

Testing prompt changes against a golden dataset (regression testing).
Comparing different prompt versions (A/B testing) on the same inputs.
Evaluating performance on specific edge cases or scenarios defined in the dataset.
Generating scores for metrics that require ground truth (e.g., Exact Match, Semantic Similarity).

How to Run:

Ensure you have a Dataset prepared with relevant inputs (and expected_output columns if needed by your evaluation metrics).
Navigate to the specific Evaluation you want to run (within your prompt’s “Evaluations” tab).
Click the “Run experiment” button.
Define the experiment variants
Select the Dataset you want to run the experiment against.
You will be redirected to the experiments tab with the results

Running Evaluations Continuously (Live Mode / Ongoing)

Live evaluations automatically run on new logs as they are generated by your prompt in production (via API calls or the Playground). This provides continuous monitoring of prompt quality.

Live Evaluation

Use Cases:

Real-time monitoring of key quality metrics (e.g., validity, safety, basic helpfulness).
Quickly detecting performance regressions caused by model updates or unexpected inputs.
Tracking overall prompt performance trends over time.

How to Enable:

Navigate to the specific Evaluation you want to run live.
Go to its settings.
Toggle the “Live Evaluation” option ON.
Save the settings.

Evaluations requiring an expected_output (like Exact Match, Lexical Overlap, Semantic Similarity) or Manual Evaluations cannot run in live mode, as they need pre-existing ground truth or human input.

Viewing Evaluation Results

Whether run experiments or live mode, results are accessible:

Logs View: Individual logs show scores/results from all applicable evaluations that have run on them.
Evaluations Tab (Per Prompt): View aggregated statistics, score distributions, success rates, and time-series trends for each specific evaluation.
Experiments: When you run evaluations as experiments, you can view detailed results, compare different variants

These results provide the data needed to understand performance, identify issues, and drive improvements using Prompt Suggestions.

Next Steps

Learn how to prepare data using Datasets
Understand how evaluation results power Prompt Suggestions
Explore the different Evaluation Types

Getting started

Prompts

Agents

Evaluations

Datasets

Experiments

Deployment

Self-Hosting

Support

Running Evaluations on Datasets (Run experiment)

Running Evaluations Continuously (Live Mode / Ongoing)

Viewing Evaluation Results

Next Steps

Getting started

Prompts

Agents

Evaluations

Datasets

Experiments

Deployment

Self-Hosting

Support

​Running Evaluations on Datasets (Run experiment)

​Running Evaluations Continuously (Live Mode / Ongoing)

​Viewing Evaluation Results

​Next Steps

Running Evaluations on Datasets (Run experiment)

Running Evaluations Continuously (Live Mode / Ongoing)

Viewing Evaluation Results

Next Steps