Once you have defined evaluation criteria (LLM-as-Judge, Programmatic Rules), you need to run them against your prompt’s logs to generate results. Latitude supports two primary modes for running automated evaluations:

Running Evaluations on Datasets (Batch Mode)

Batch evaluations allow you to assess prompt performance across a predefined set of inputs and expected outputs contained within a Dataset.

Use Cases:

  • Testing prompt changes against a golden dataset (regression testing).
  • Comparing different prompt versions (A/B testing) on the same inputs.
  • Evaluating performance on specific edge cases or scenarios defined in the dataset.
  • Generating scores for metrics that require ground truth (e.g., Exact Match, Semantic Similarity).

How to Run:

  1. Ensure you have a Dataset prepared with relevant inputs (and expected_output columns if needed by your evaluation metrics).
  2. Navigate to the specific Evaluation you want to run (within your prompt’s “Evaluations” tab).
  3. Click the “Run Batch Evaluation” button.
  4. Select the Dataset you want to run the evaluation against.
  5. Confirm and start the batch job.
  6. Latitude will process each row in the dataset, run the associated prompt, and then apply the selected evaluation to the resulting log.

Running Evaluations Continuously (Live Mode / Ongoing)

Live evaluations automatically run on new logs as they are generated by your prompt in production (via API calls or the Playground). This provides continuous monitoring of prompt quality.

Use Cases:

  • Real-time monitoring of key quality metrics (e.g., validity, safety, basic helpfulness).
  • Quickly detecting performance regressions caused by model updates or unexpected inputs.
  • Tracking overall prompt performance trends over time.

How to Enable:

  1. Navigate to the specific Evaluation you want to run live.
  2. Go to its settings.
  3. Toggle the “Live Evaluation” option ON.
  4. Save the settings.

Evaluations requiring an expected_output (like Exact Match, Lexical Overlap, Semantic Similarity) or Manual Evaluations cannot run in live mode, as they need pre-existing ground truth or human input.

Viewing Evaluation Results

Whether run in batch or live mode, results are accessible:

  • Logs View: Individual logs show scores/results from all applicable evaluations that have run on them.
  • Evaluations Tab (Per Prompt): View aggregated statistics, score distributions, success rates, and time-series trends for each specific evaluation.
  • Batch Evaluation Results Page: After a batch job completes, you can view a dedicated summary page showing overall statistics and a detailed breakdown of results for each item in the dataset.

These results provide the data needed to understand performance, identify issues, and drive improvements using Prompt Suggestions.

Next Steps