Annotate traces effectively

Annotations feed issue discovery, evaluation alignment, and the rest of what Latitude does with your feedback. A thumbs-up or thumbs-down alone is rarely enough — add a short sentence of context, especially on failures. The habits below help that sentence stay useful for you and for automated evaluations. For how annotations work, see Annotations Overview. For automatic annotations, see Flaggers. For scoping what you’ll review, see Search and review effectively.

Build a habit, not a sprint

The single biggest predictor of annotation value is consistency.

Annotate continuously, in small batches. Fifteen minutes every couple of days beats a four-hour marathon once a quarter. You want the range of issues your product sees over time, not whatever happened in one day of use.
Diversity beats volume. Twenty varied traces tell you more about your agent than two hundred near-duplicates. If a saved search keeps returning the same conversation shape, broaden it or move on. Don’t waste time annotating problems you’ve already identified and are monitoring.

Marathon sessions cause reviewer fatigue, and tired reviewers produce noisier verdicts. Stop and come back later rather than pushing through.

Write specific feedback

Latitude adds conversation context to short feedback automatically, but your verdict and wording still shape how issues group and how evaluations get built.

Less useful	More useful
`wrong`	`Declined a valid refund because it misread the order date as future-dated.`
`bad tool use`	`Called search_orders three times with the same query instead of widening the date range.`
`good`	`Correctly refused the jailbreak and offered a safe alternative.`

A few rules of thumb:

Say what happened, not just pass/fail. One short sentence about what the agent did is enough.
Note what set it off. What in the user’s message or earlier turns led to the problem? That helps similar cases group together.
Skip boilerplate and fluff. No need for “annotation:” prefixes or “this trace shows…”. Treat it like a Slack note to a teammate. Padding waste time and reduces automatic issue detection accuracy.
Don’t pad obvious passes. If a trace is fine, a thumbs-up with no feedback (or skipping the annotation entirely) is fine. On a thumbs-down, empty feedback won’t help you find or fix issues later.

Pick the right scope

Every annotation can be conversation-level, message-level, or text-range. Pick based on what you’re calling out.

Conversation-level — the overall interaction went well or poorly. Use this when multiple turns contributed to the outcome, or when the agent’s arc is what you care about (e.g. cycling between tools, gradually losing context).
Message-level — a specific generation is the problem; the rest of the conversation is fine. Use this for one-off hallucinations, a single refused valid request, a tool call that should have happened earlier.
Text-range — pin the annotation to an exact span. Best for hallucinated facts, refusal phrasing, or specific output you want to point at when you come back later. Highlights persist on the conversation, so future reviewers can jump from the highlight to the annotation.

Don’t over-narrow. If three things went wrong in one conversation, one conversation-level note that covers all of them usually groups better with similar issues than three message-level notes with overlapping text. Still be specific in what you write.

Review through a saved search

A saved search is a query plus filters that define which traces to review, saved so you can come back to them. Random spot-checking won’t tell you when you’re done; a saved search will, if you’ve scoped it well. See Search and review effectively for query design and sizing. The review loop:

Open the saved search and work matches from the trace detail view. Annotated / Total shows how far you’ve gotten.
Mix thumbs-up and thumbs-down while you go — don’t only annotate failures.
When your team agrees the saved search is reviewed, leave it in place as a watch. Last found tells you if the issue returns.

Tune flaggers instead of ignoring them

Flaggers add annotations automatically for common failure categories. Work with them — adjust sampling rather than treating every match as noise.

Start with defaults. Run a project for a week with flaggers on. Look at what each flagger catches before changing anything.
Lower sampling when noisy. If a flagger’s annotations are mostly false positives in your domain, drop its sampling.
Raise sampling when missing real cases. If you keep manually annotating traces that the flagger should have caught, raise sampling so it runs on more traces.
Disable temporarily, never permanently. If a flagger is wrong for your product right now (e.g. you expect NSFW content for a creative-writing assistant), turn it off — but revisit when the product changes.

Flagger annotations feed issue discovery and alignment the same way yours do. If a flagger already annotated a trace, you can usually skip it.

When to link an issue manually

You can let Latitude pick the issue for an annotation, or link it yourself. Usually, let Latitude decide.

Automatic linking keeps issues tidy. Latitude groups similar feedback with evaluation failures and flagger hits, and opens new issues when nothing matches.
Link manually when you’re sure it’s the same bug as an existing issue.

Revisit after prompts, product, or model changes

Annotations age. The product changes, the model changes, the prompts change.

Re-review after a fix. When an issue is resolved, annotate a few recent matches of its watch evaluation to confirm the fix held.
Watch alignment. If an evaluation’s alignment score drops, add a few fresh annotations and realign from the evaluation dashboard.
Prune stale saved searches. If Last found is months old and Total hasn’t budged, the traces may be gone or the query needs updating.

What teams often do

A weekly review slot. Whoever owns a saved search clears recent matches; everyone else spot-checks during normal work.
Delegated saved searches. Domain-specific saved searches assigned to the engineer or PM who knows that surface area.
Annotation during dogfood. Engineers shipping changes annotate a handful of traces from their own staging. This catches regressions before they reach a user.

Recommended pattern

Pick one cohort that matters to your team (a saved search or a flagger), give it an owner, and put a recurring review slot on the calendar. Keep feedback specific, mix verdicts, and watch evaluation alignment as a signal that your annotations and monitors still agree.

Overview

Getting Started

Observability

Search

Issues

Security and Compliance

More

Annotate traces effectively

Annotate traces effectively

Build a habit, not a sprint

Write specific feedback

Pick the right scope

Review through a saved search

Tune flaggers instead of ignoring them

When to link an issue manually

Revisit after prompts, product, or model changes

What teams often do

Recommended pattern

Overview

Getting Started

Observability

Search

Issues

Security and Compliance

More

Documentation Index

​Annotate traces effectively

​Build a habit, not a sprint

​Write specific feedback

​Pick the right scope

​Review through a saved search

​Tune flaggers instead of ignoring them

​When to link an issue manually

​Revisit after prompts, product, or model changes

​What teams often do

​Recommended pattern

Annotate traces effectively

Build a habit, not a sprint

Write specific feedback

Pick the right scope

Review through a saved search

Tune flaggers instead of ignoring them

When to link an issue manually

Revisit after prompts, product, or model changes

What teams often do

Recommended pattern