Programmatic Rule evaluations apply objective, code-based rules and metrics to assess prompt outputs. They are ideal for validating specific requirements, checking against ground truth, and enforcing constraints automatically.

How it Works

  1. Target Prompt Run: Your main prompt generates an output (log).
  2. Evaluation Trigger: The Programmatic Rule evaluation is run on that log.
  3. Rule Execution: Latitude applies the configured rule/metric algorithm to the prompt’s output (and potentially input or dataset values).
  4. Result Calculation: The algorithm calculates a score (e.g., length, similarity score) or a classification (e.g., matched/unmatched, valid/invalid).
  5. Result Storage: The result is stored alongside the original log.

Use Cases

Programmatic rules excel when:

  • Objective Checks: Evaluating criteria that are clearly defined and measurable (e.g., output length, format adherence).
  • Ground Truth Comparison: Checking if the output matches an expected answer provided in a Dataset.
  • Format Validation: Ensuring the output conforms to a specific structure (e.g., JSON schema, regular expression).
  • Constraint Enforcement: Verifying constraints like maximum length or absence of specific keywords.
  • Speed & Cost: Evaluating large volumes of logs quickly and at no additional LLM cost.

Trade-offs

  • Rigidity: May fail valid responses that differ slightly from the expected format or ground truth.
  • Limited Nuance: Cannot easily evaluate subjective qualities like tone, style, or creativity.
  • Setup: Requires defining specific rules or providing datasets with expected outputs.

For subjective criteria, use LLM-as-Judge. For human preferences, use Manual Evaluations.

Available Metrics & Rules

Latitude provides a complete suite of built-in metrics for evaluating prompts programmatically.

Exact Match

Checks if the response is exactly the same as the expected output. The resulting score is “matched” or “unmatched”.

Exact Match evaluation require an expected output, so it does not support live evaluation.

Regular Expression

Checks if the response matches the regular expression. The resulting score is “matched” or “unmatched”.

Schema Validation

Checks if the response follows the schema. The resulting score is “valid” or “invalid”. Right now only JSON schemas are supported.

Length Count

Checks if the response is of a certain length. The resulting score is the length of the response. The length can be counted by characters, words or sentences.

Lexical Overlap

Checks if the response contains the expected output. The resulting score is the percentage of overlap. Overlap can be measured with substring, Levenshtein distance and ROUGE algorithms.

Lexical Overlap evaluation require an expected output, so it does not support live evaluation.

Semantic Similarity

Checks if the response is semantically similar to the expected output. The resulting score is the percentage of similarity. Similarity is measured by computing the cosine similarity.

Semantic Similarity evaluation require an expected output, so it does not support live evaluation.

Using Datasets for Ground Truth

Many programmatic rules (Exact Match, Lexical Overlap, Semantic Similarity) require comparing the model’s output against a known correct answer (expected_output). This is typically done by:

  1. Creating a Dataset containing input examples and their corresponding desired outputs.
  2. Configuring the evaluation rule to use the expected_output column from that dataset.
  3. Running the evaluation in batch mode on that dataset.

Setting Up a Programmatic Rule Evaluation

  1. Navigate to your prompt’s “Evaluations” tab.
  2. Click “New Evaluation”.
  3. Choose “Programmatic Rule” as the type.
  4. Select the desired Metric from the available options (e.g., Exact Match, Regex, Length Count).
  5. Configure the metric’s specific parameters (e.g., provide the regex pattern, select the dataset column for expected output, set length constraints).
  6. Save the evaluation.

Viewing Evaluation Results

Results are visible in:

  • Logs View: Individual log entries show pass/fail status or scores.
  • Evaluations Tab: Aggregated results, success rates, score distributions.
  • Batch Evaluation Results: Detailed breakdown when run on datasets.

Next Steps