Use code-based metrics and rules to objectively evaluate prompt outputs.
Programmatic Rule evaluations apply objective, code-based rules and metrics to assess prompt outputs. They are ideal for validating specific requirements, checking against ground truth, and enforcing constraints automatically.
How it works: Applies code-based rules and metrics to check outputs against objective criteria.
Best for: Objective checks, ground truth comparisons (using datasets), format validation (JSON, regex), safety checks (keyword detection), length constraints.
Requires: Defining specific rules (e.g., exact match, contains keyword, JSON schema validation) and potentially providing a Dataset with expected outputs.
Checks if the response is of a certain length. The resulting score is the length of the response. The length can be counted by characters, words or sentences.
Checks if the response contains the expected output. The resulting score is the percentage of overlap. Overlap can be measured with longest common substring, Levenshtein distance and ROUGE algorithms.
Checks if the response is semantically similar to the expected output. The resulting score is the percentage of similarity. Similarity is measured by computing the cosine distance.
Checks if the response is numerically similar to the expected output. The resulting score is the percentage of similarity. Similarity is measured by computing the relative difference.
The expected output, also known as label, refers to the correct or ideal response that the language model should generate for a given prompt. You can create datasets with expected output columns to evaluate prompts with ground truth.
Many programmatic rules (Exact Match, Lexical Overlap, Semantic Similarity) require comparing the model’s output against a known correct answer (expected_output). This is typically done by:
Creating a Dataset containing input examples and their corresponding desired outputs.
Configuring the evaluation rule to use the expected_output column from that dataset.
Running the evaluation in an experiment on that dataset.