Evaluations overview

An evaluation is an automated detector that scores sessions as they arrive. It watches for one behavior or quality criterion, runs on completed traffic, and produces a score each time it checks a session. Those scores feed the same analytics, signal, and alignment workflows as annotations and flaggers. Every signal is backed by an evaluation. When a signal’s evaluation matches a session, that session joins the signal.

What an evaluation has

A name and description: the behavior being detected.
A detection method: how it decides whether a session matches. See Detection methods.
A trigger: which sessions it runs on, and at what sampling rate. See Triggers.

How an evaluation runs

A session completes in your project.
Latitude checks it against each active evaluation’s scope and sampling.
Matching evaluations score the session.
Each returns a pass or fail verdict with feedback, stored as a score.
A passing score adds the session to the evaluation’s signal.

passed = true means the behavior is present, not that the session was good. A signal for a bad behavior passes when that behavior happens.

Where evaluations come from

An evaluation can be created two ways.

Generated from a signal

When Latitude discovers a signal, or when you choose to monitor one, it can generate an evaluation from the signal’s description, example traces, annotations, and scores. You don’t pick the method. Latitude builds a detector from the evidence and keeps it aligned to human judgment over time.

Defined by you

When you create a signal yourself, you define its evaluation directly. You choose one of three detection methods:

Set of conditions: deterministic checks, free and instant.
LLM as judge: describe the behavior and let an LLM decide.
Custom script: JavaScript for anything the other two can’t express.

A detector you define runs exactly as written. It is not automatically realigned to annotations the way a generated one is. See Alignment.

Choosing a detection method

Clear structural failures, such as tool errors, empty responses, or latency over a limit, are a good fit for a set of conditions. Semantic behavior, such as relevance, tone, or whether an answer resolved the request, usually needs an LLM judge. When neither fits, a custom script gives you full control. See Detection methods for the full catalog.

Evaluation lifecycle

Active: scoring matching sessions in real time.
Paused: sampling set to 0, configuration preserved.
Archived: read-only and no longer scoring new sessions.
Deleted: removed from management views, while historical results stay in analytics.

Next steps

Detection methods: the three ways an evaluation decides
Custom scripts: the scripting reference
Triggers: scope and sampling
Alignment: how evaluations stay calibrated to human judgment
Signals: how evaluation matches become tracked signals

Overview

Getting Started

Observe

Understand

Refine

Security and Compliance

Deployment

Development

More

Evaluations overview

Evaluations overview

What an evaluation has

How an evaluation runs

Where evaluations come from

Generated from a signal

Defined by you

Choosing a detection method

Evaluation lifecycle

Next steps

​Evaluations overview

​What an evaluation has

​How an evaluation runs

​Where evaluations come from

​Generated from a signal

​Defined by you

​Choosing a detection method

​Evaluation lifecycle

​Next steps

Evaluations overview

What an evaluation has

How an evaluation runs

Where evaluations come from

Generated from a signal

Defined by you

Choosing a detection method

Evaluation lifecycle

Next steps