What are programmatic rule evaluations?

Programmatic rule evaluations use algorithmic metrics to evaluate the quality of LLM responses. This approach is particularly powerful when you need to enforce specific requirements, validate formats, or check against predefined criteria.

Using algorithms to judge

Use cases

Programmatic rule evaluations are best suited for scenarios where:

  • Format Validation: You need to ensure responses follow specific formats or schemas
  • Expected Output Comparison: You want to compare responses to expected outputs
  • Constraint Requirements: You need to enforce specific constraints on responses, like length or avoiding specific words
  • Fast, Zero Cost Evaluation: You need to evaluate a large number of responses quickly

Trade-offs

While programmatic rule evaluations are powerful, they come with some considerations:

  • Rigidity: Rules may not account for valid variations in responses
  • Limited Context: Rules may not understand nuanced differences or contextual similarities
  • Simple Criteria: Custom rules may be needed to evaluate complex criteria

For subjective or complex criteria that require understanding, consider using LLM as judge evaluations instead. For user-given feedback or required human verification, consider using Human-in-the-loop evaluations.

Available metrics

Latitude provides a complete suite of built-in metrics for evaluating prompts programmatically.

Exact Match

Checks if the response is exactly the same as the expected output. The resulting score is “matched” or “unmatched”.

Exact Match evaluation require an expected output, so it does not support live evaluation.

Regular Expression

Checks if the response matches the regular expression. The resulting score is “matched” or “unmatched”.

Schema Validation

Checks if the response follows the schema. The resulting score is “valid” or “invalid”. Right now only JSON schemas are supported.

Length Count

Checks if the response is of a certain length. The resulting score is the length of the response. The length can be counted by characters, words or sentences.

Lexical Overlap

Checks if the response contains the expected output. The resulting score is the percentage of overlap. Overlap can be measured with substring, Levenshtein distance and ROUGE algorithms.

Lexical Overlap evaluation require an expected output, so it does not support live evaluation.

Semantic Similarity

Checks if the response is semantically similar to the expected output. The resulting score is the percentage of similarity. Similarity is measured by computing the cosine similarity.

Semantic Similarity evaluation require an expected output, so it does not support live evaluation.

Creating a programmatic rule evaluation

You can create programmatic rule evaluations by clicking the button “Add evaluation” in the Evaluations tab of your prompt. Select “Programmatic Rule” as the evaluation type and configure the metric and its parameters.

To learn more about how to run evaluations, check out the Running evaluations guide.