Skip to main content
Where this fits: Datasets are part of Refine, after Signals. They turn real traces into reusable test cases for regression testing.
A dataset is a collection of rows you curate for testing and improving your agent. Each row holds an input, the agent’s output, an optional expected output, and arbitrary metadata. Teams use them as golden datasets: stable, known-good test sets that a fix has to keep passing.
The Datasets page listing golden datasets with name, description, and last updated

What a dataset row contains

ColumnDescription
InputThe input your agent received, for example the user message.
OutputWhat your agent actually returned.
Expected outputThe correct or desired answer, used to check the agent. Optional, see Add expected output.
MetadataArbitrary fields carried alongside the row.
A dataset detail view showing rows with input, output, and expected output columns

Create a dataset

You can build a dataset three ways:

From real traces

Select traces from the trace list, search results, or a signal, and add them to a dataset. The most realistic test cases come straight from production.

Manually

Open Datasets in your project, create a new dataset, then Import a CSV or Add row to enter cases by hand.

From your coding agent

Through the MCP server, an agent like Claude or Cursor can create datasets and pull in the traces behind a signal for you.

How datasets are used

  • Regression testing: replay a dataset’s inputs against your agent and compare results to the expected outputs and your evaluations. See Regression testing.
  • Curating test sets: collect representative traces from Search and Signals into a stable, reusable set.
  • Sharing with your harness: export a dataset as CSV to drive tests in your own pipeline.

Next step