1. Define Your Quality Standards
- Identify Key Metrics: What defines a “good” response for this prompt? (e.g., Accuracy, Helpfulness, Conciseness, Safety, Format Adherence).
- Set Acceptance Criteria: Define minimum acceptable scores or pass rates for your key evaluations.
- Choose Evaluation Types: Select the right mix of LLM-as-Judge, Programmatic Rules, and Manual Evaluations to cover your criteria.
2. Establish Golden Datasets
- Create and maintain a representative Dataset (a “golden dataset”) containing diverse inputs and, where applicable, expected outputs.
- This dataset serves as your benchmark for regression testing.
- Include challenging edge cases and examples representing different user intents.
3. Evaluation During Development
- Playground Testing: Use the Playground to get immediate evaluation feedback while iterating on prompts.
- Draft Evaluations: Run experiments on your golden dataset before merging changes from a draft version.
- Peer Review: Include evaluation results (especially for failing cases) as part of the review process for prompt changes.
4. Continuous Monitoring in Production
- Live Evaluations: Enable live evaluations for critical metrics (e.g., format validation, safety checks, basic relevance) to monitor real-time performance.
6. Feedback Loops and Improvement
- Regular Review Meetings: Discuss evaluation trends and results as a team.
- Analyze Failures: Dig into logs with poor evaluation scores to understand the root causes.
- Leverage Suggestions: Use the Prompt Suggestions feature to guide improvements.
- Update Golden Dataset: Periodically add new challenging examples or successful edge cases from production logs to your golden dataset.
- Refine Evaluations: Adjust evaluation criteria or prompts as your understanding of quality evolves.