Running Evaluations
Learn how to run evaluations on your prompts.
Once you’ve defined evaluations for your prompts, you can run them on the playground, live logs or in batch through datasets.
Running evaluations in the playground
The Playground is a powerful tool where you can see the results of your evaluations after running a prompt. This simplifies the workflow of iterating on both your prompts and your evaluations. You can also access the evaluation’s editor directly with that log already imported.
Running evaluations in live mode
Evaluations in live mode will run automatically on all new logs generated in your project, including from the playground and API, but not from datasets. This is useful if you want to monitor the performance of your prompts in real-time.
We recommend keeping a few key evaluations running in live mode to spot degradations in response quality as soon as they happen. Sometimes new model releases or changes in parameters can lead to a drop in response quality, so this is a good way to catch those issues early.
You can enable or disable live evaluation in the evaluation’s settings at any time.
Evaluations that require an expected output, or human verification, do not support live evaluation.
Running evaluations in batch mode
In order to asses the performance of your prompt over a larger set of predefined use cases, you can run evaluations in batch mode. You can either run batch evaluations from a dataset, or use existing production logs, by creating a dataset from the logs you want to use.
On the evaluation’s dashboard, click the “Run batch evaluation” button to start the evaluation process. You’ll see the status of the batch evaluation just above the results table. Once it’s finished, the statistics will update with the results of the evaluation, and you can later check the evaluation logs to drill down into the results.
Evaluations that require human intervention do not support batch evaluation.