Why Use a Golden Dataset?
- Prevent Regressions: Ensure that changes to your prompt (or underlying models) don’t break previously working functionality or degrade quality on important cases.
- Consistent Benchmarking: Provide a stable baseline for comparing the performance of different prompt versions.
- Confidence in Deployment: Increase confidence that a new prompt version meets quality standards before publishing.
- Capture Edge Cases: Explicitly test how your prompt handles known difficult or important scenarios.
Creating a Golden Dataset
- Identify Critical Scenarios: Determine the most important inputs or use cases your prompt must handle correctly.
- Gather Examples: Collect representative examples for these scenarios. Sources include:
- Real production Logs (especially successful ones or interesting failures).
- Manually crafted edge cases.
- Existing test suites.
- Define Expected Outputs (Ground Truth): For each input, define the ideal or minimally acceptable output. This might be:
- An exact string.
- A specific JSON structure.
- Key information that must be present.
- A classification label.
- Format as CSV: Structure this data into a CSV file with appropriate input columns (matching prompt parameters) and output columns (e.g.,
expected_output
,expected_category
). - Upload to Latitude: Upload the CSV as a new Dataset in Latitude and give it a clear name (e.g., “Chatbot v2 - Golden Regression Set”).
- Marking the expected output column: You can mark the expected output column as a ‘label’ by clicking on the column name and editing its role.
Using the Golden Dataset in Workflows
- During Development: When iterating on a prompt in a draft version, run batch evaluations using relevant Programmatic Rules (like Exact Match, Semantic Similarity, JSON Validation) against the golden dataset to check for regressions before considering the draft ready.
- CI/CD Pipeline: Integrate automated batch evaluations against the golden dataset into your pre-deployment checks. Fail the build if key metrics on the golden dataset drop below a threshold.
- Version Comparison: When comparing two prompt versions (e.g., A/B testing), run both against the golden dataset using the same evaluations to get a standardized performance comparison.
Maintaining the Golden Dataset
- Review Periodically: Regularly review the golden dataset to ensure it still represents the most critical scenarios.
- Add New Cases: As new important use cases or failure modes are discovered in production, consider adding them to the golden dataset.
- Version Control (Implicit): While datasets themselves aren’t directly versioned within Latitude like prompts, you can manage your source CSV files in your own version control system (like Git) if needed.
Next Steps
- Learn more about Creating and Using Datasets
- Set up Programmatic Rule Evaluations to use with your dataset.
- Integrate checks into your Team Workflows.