Numeric evaluations allow you to assess the quality of your LLM outputs using a numerical scale. This type of evaluation is helpful when you want to score outputs within a specific range, for example, a score between 0 and 10.

Creating a numeric evaluation

To create a numeric evaluation:

  1. Go to the Evaluations tab in your project.
  2. Click on the Create evaluation button.
  3. Provide a name for your evaluation.
  4. Select Number as the evaluation type.
  5. Specify the minimum and maximum values for your evaluation range.

Writing the evaluation prompt

When creating a numeric evaluation, you need to ensure that your evaluation prompt returns a score within the specified range. The output should be a JSON object with the following format:

{
  "result": <numeric_score>,
  "reason": <explanation_for_the_score>
}

Remember to specify the evaluation range you selected when creating the evaluation in your prompt. This ensures that the LLM understands the scale it should use when providing a score. For example, if you set the range from 0 to 10, your prompt might include a line like:

“Please evaluate the following output on a scale from 0 to 10, where 0 is the lowest quality and 10 is the highest quality.”

This helps maintain consistency between your evaluation settings and the actual scoring process.

Make sure to include this format in your evaluation prompt. If you’re not sure how to structure your prompt, you can use one of the provided templates as a reference.

Best Practices

  1. Choose an appropriate range: Select a range that provides enough granularity for your evaluation needs.
  2. Be consistent: Use the same numeric scale across similar evaluations for easier comparison.
  3. Provide clear criteria: In your evaluation prompt, clearly define what each score represents to ensure consistent scoring.
  4. Use alongside other evaluation types: Combine numeric evaluations with boolean and text evaluations for a more comprehensive assessment of your LLM outputs.

By using numeric evaluations effectively, you can quantitatively measure the performance of your prompts and make data-driven decisions to improve your LLM applications.