Prerequisites
Clone the SDK repository and run commands from the repository root (agent-sdk-go/), not from examples/:
localhost:7233.
What it verifies
Each run exercises the real SDK agent loop (agent.NewAgent, Run) and outputs:
| Field | Use |
|---|---|
content | Final assistant response text |
llm_usage | Token counts from the mock LLM |
telemetry | Run lifecycle, tool breakdown, storage counts — see Telemetry |
telemetry.run.finish_reason, telemetry.tools.breakdown, telemetry.tools.failed_calls, and llm_usage.total_tokens.
Run the harness
From the repository root:eval-harness job in .github/workflows/ci.yml.
CLI flags
| Flag | Default | Description |
|---|---|---|
-config | eval-harness/runner/config.yaml | Path to config file |
-prompt | from config | Override user_prompt |
-runtime | from config | local or temporal |
-tools | from config | Override agent.tool_count |
config.yaml
Default path:eval-harness/runner/config.yaml
| Field | Default | Description |
|---|---|---|
runtime | local | local or temporal |
user_prompt | — | User message (required) |
agent.name | eval-agent | Agent name |
agent.system_prompt | built-in eval prompt | System instructions |
agent.tool_count | 3 | Number of mock tools |
temporal.host | localhost | Temporal host when runtime: temporal |
temporal.port | 7233 | Temporal port |
temporal.namespace | default | Temporal namespace |
temporal.task_queue | eval-harness | Task queue |
Memory scenarios
Enable memory tests in config or via helper scripts:| Field | Default | Description |
|---|---|---|
memory.enabled | false | Enable memory tests |
memory.store_mode | ondemand | ondemand or always |
memory.scenario | store_recall | Two-run store then recall |
memory.scenario: store_recall is active, output includes a memory_scenario object with separate store and recall results.
Output format
Stdout is always JSON:Sample output
A successful run prints structured JSON to stdout:finish_reason === "complete", failed_calls === 0, all mock tools called exactly once, token counts match mock config.
Run against a real LLM provider
The default harness uses a mock LLM for reproducible, zero-cost runs. To validate against a real provider:- Copy and adapt the runner setup — swap the mock client for a real one:
- Set your API key and run:
- Assert on
contenttext quality with your scoring framework (PromptFoo / DeepEval) alongside the behavioraltelemetryassertions.
PromptFoo integration
Config:eval-harness/promptfoo/config.yaml
PromptFoo runs the harness as an exec provider. Each test invokes the runner once, parses JSON stdout, and asserts with JavaScript.
npx.
| Piece | Role |
|---|---|
| Provider | exec:../run_agent.sh — wrapper in eval-harness/ |
| Output | Runner JSON on stdout; assertions use JSON.parse(output) |
| Agent settings | From eval-harness/runner/config.yaml |
finish_reason === "complete", zero failed tool calls, token usage reported.
DeepEval integration
Python tests ineval-harness/deepeval/:
harness.run_agent()callseval-harness/run_agent.shand parses JSON- Tests assert on
content,llm_usage, andtelemetry ToolCorrectnessMetricusestelemetry.tools.breakdownkeys astools_called
Build your own evals
The harness pattern applies to any framework:- Run the agent with deterministic mocks (or a fixed LLM in staging)
- Capture
AgentRunResultfields —Content,LLMUsage,Telemetry - Assert on behavioral signals, not just text match
eval-harness/runner/setup/ with a real provider and keep the same output envelope.
Related
Telemetry
Fields available for assertions
Benchmarks
Load and concurrency testing
Tools
Tool registration and execution
Readiness Checklist
Production deployment guidelines