A “Golden Dataset” is a carefully curated collection of inputs and expected outputs that represents critical test cases and desired behaviors for your prompt. It serves as a benchmark to prevent regressions when making changes.

Why Use a Golden Dataset?

  • Prevent Regressions: Ensure that changes to your prompt (or underlying models) don’t break previously working functionality or degrade quality on important cases.
  • Consistent Benchmarking: Provide a stable baseline for comparing the performance of different prompt versions.
  • Confidence in Deployment: Increase confidence that a new prompt version meets quality standards before publishing.
  • Capture Edge Cases: Explicitly test how your prompt handles known difficult or important scenarios.

Creating a Golden Dataset

  1. Identify Critical Scenarios: Determine the most important inputs or use cases your prompt must handle correctly.
  2. Gather Examples: Collect representative examples for these scenarios. Sources include:
    • Real production Logs (especially successful ones or interesting failures).
    • Manually crafted edge cases.
    • Existing test suites.
  3. Define Expected Outputs (Ground Truth): For each input, define the ideal or minimally acceptable output. This might be:
    • An exact string.
    • A specific JSON structure.
    • Key information that must be present.
    • A classification label.
  4. Format as CSV: Structure this data into a CSV file with appropriate input columns (matching prompt parameters) and output columns (e.g., expected_output, expected_category).
  5. Upload to Latitude: Upload the CSV as a new Dataset in Latitude and give it a clear name (e.g., “Chatbot v2 - Golden Regression Set”).

Using the Golden Dataset in Workflows

  • During Development: When iterating on a prompt in a draft version, run batch evaluations using relevant Programmatic Rules (like Exact Match, Semantic Similarity, JSON Validation) against the golden dataset to check for regressions before considering the draft ready.
  • CI/CD Pipeline: Integrate automated batch evaluations against the golden dataset into your pre-deployment checks. Fail the build if key metrics on the golden dataset drop below a threshold.
  • Version Comparison: When comparing two prompt versions (e.g., A/B testing), run both against the golden dataset using the same evaluations to get a standardized performance comparison.

Maintaining the Golden Dataset

  • Review Periodically: Regularly review the golden dataset to ensure it still represents the most critical scenarios.
  • Add New Cases: As new important use cases or failure modes are discovered in production, consider adding them to the golden dataset.
  • Version Control (Implicit): While datasets themselves aren’t directly versioned within Latitude like prompts, you can manage your source CSV files in your own version control system (like Git) if needed.

By establishing and maintaining golden datasets, you create a robust safety net for your prompt development lifecycle.

Next Steps