Skip to main content
The eval harness runs a single agent execution with mock LLM and mock tools, then prints structured JSON to stdout. Use it to catch breaking changes in CI and as a reference for wiring your own agents into eval tools like PromptFoo and DeepEval. No LLM API key is required for default tests.

Prerequisites

Clone the SDK repository and run commands from the repository root (agent-sdk-go/), not from examples/:
git clone https://github.com/agenticenv/agent-sdk-go.git
cd agent-sdk-go
go run ./eval-harness/runner
Requires Go 1.26+ (same as the SDK). Temporal mode needs a running Temporal server on localhost:7233.

What it verifies

Each run exercises the real SDK agent loop (agent.NewAgent, Run) and outputs:
FieldUse
contentFinal assistant response text
llm_usageToken counts from the mock LLM
telemetryRun lifecycle, tool breakdown, storage counts — see Telemetry
Assertions typically check telemetry.run.finish_reason, telemetry.tools.breakdown, telemetry.tools.failed_calls, and llm_usage.total_tokens.

Run the harness

From the repository root:
go run ./eval-harness/runner
go run ./eval-harness/runner -prompt "custom prompt"
go run ./eval-harness/runner -runtime temporal
go run ./eval-harness/runner -tools 2
go run ./eval-harness/runner -config eval-harness/runner/config.yaml
Or use the Makefile shortcut:
make eval-harness
CI runs PromptFoo and DeepEval on pull requests — see the eval-harness job in .github/workflows/ci.yml.

CLI flags

FlagDefaultDescription
-configeval-harness/runner/config.yamlPath to config file
-promptfrom configOverride user_prompt
-runtimefrom configlocal or temporal
-toolsfrom configOverride agent.tool_count

config.yaml

Default path: eval-harness/runner/config.yaml
FieldDefaultDescription
runtimelocallocal or temporal
user_promptUser message (required)
agent.nameeval-agentAgent name
agent.system_promptbuilt-in eval promptSystem instructions
agent.tool_count3Number of mock tools
temporal.hostlocalhostTemporal host when runtime: temporal
temporal.port7233Temporal port
temporal.namespacedefaultTemporal namespace
temporal.task_queueeval-harnessTask queue
Temporal mode uses an embedded local worker — no separate worker process required.

Memory scenarios

Enable memory tests in config or via helper scripts:
FieldDefaultDescription
memory.enabledfalseEnable memory tests
memory.store_modeondemandondemand or always
memory.scenariostore_recallTwo-run store then recall
./eval-harness/run_agent_memory.sh ondemand
./eval-harness/run_agent_memory.sh always
When memory.scenario: store_recall is active, output includes a memory_scenario object with separate store and recall results.

Output format

Stdout is always JSON:
{
  "content": "eval complete",
  "llm_usage": {
    "prompt_tokens": 600,
    "completion_tokens": 400,
    "total_tokens": 1000
  },
  "telemetry": {
    "run": { "total_llm_calls": 2, "finish_reason": "complete" },
    "tools": { "total_calls": 3, "failed_calls": 0, "breakdown": { "...": 1 } },
    "storage": { }
  }
}
Parse this in your eval framework and assert on telemetry fields — the same contract used by PromptFoo and DeepEval integrations in the repo.

Sample output

A successful run prints structured JSON to stdout:
$ go run ./eval-harness/runner
{
  "content": "eval complete",
  "llm_usage": {
    "prompt_tokens": 600,
    "completion_tokens": 400,
    "total_tokens": 1000
  },
  "telemetry": {
    "run": {
      "total_llm_calls": 2,
      "finish_reason": "complete",
      "started_at": "2026-06-27T10:00:00Z",
      "completed_at": "2026-06-27T10:00:00.045Z"
    },
    "tools": {
      "total_calls": 3,
      "failed_calls": 0,
      "breakdown": {
        "mock_tool_1": 1,
        "mock_tool_2": 1,
        "mock_tool_3": 1
      }
    },
    "storage": {}
  }
}
Good results: finish_reason === "complete", failed_calls === 0, all mock tools called exactly once, token counts match mock config.

Run against a real LLM provider

The default harness uses a mock LLM for reproducible, zero-cost runs. To validate against a real provider:
  1. Copy and adapt the runner setup — swap the mock client for a real one:
// eval-harness/runner/setup/setup.go (copy and modify)
import "github.com/agenticenv/agent-sdk-go/pkg/llm/openai"

llmClient, err := openai.NewClient(
    llm.WithAPIKey(os.Getenv("OPENAI_API_KEY")),
    llm.WithModel("gpt-4o"),
)
  1. Set your API key and run:
export OPENAI_API_KEY=sk-your-key
go run ./eval-harness/runner -prompt "What is 7 times 8?"
  1. Assert on content text quality with your scoring framework (PromptFoo / DeepEval) alongside the behavioral telemetry assertions.
To compare providers, run the same prompt against multiple configs and diff the JSON output:
# Run with OpenAI
OPENAI_API_KEY=... go run ./eval-harness/runner > result_openai.json

# Run with Anthropic
ANTHROPIC_API_KEY=... go run ./eval-harness/runner > result_anthropic.json

diff result_openai.json result_anthropic.json

PromptFoo integration

Config: eval-harness/promptfoo/config.yaml PromptFoo runs the harness as an exec provider. Each test invokes the runner once, parses JSON stdout, and asserts with JavaScript.
cd eval-harness/promptfoo
npx promptfoo eval -c config.yaml
npx promptfoo view   # web UI for results
Requires Node.js. PromptFoo installs on demand via npx.
PieceRole
Providerexec:../run_agent.sh — wrapper in eval-harness/
OutputRunner JSON on stdout; assertions use JSON.parse(output)
Agent settingsFrom eval-harness/runner/config.yaml
Example assertions: all mock tools called once, finish_reason === "complete", zero failed tool calls, token usage reported.

DeepEval integration

Python tests in eval-harness/deepeval/:
cd eval-harness/deepeval
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pytest test_agent.py -v
Requires Python 3.10+ and Go.
  1. harness.run_agent() calls eval-harness/run_agent.sh and parses JSON
  2. Tests assert on content, llm_usage, and telemetry
  3. ToolCorrectnessMetric uses telemetry.tools.breakdown keys as tools_called
Example extraction:
agent_res = run_agent()
tools = list(agent_res["telemetry"]["tools"]["breakdown"].keys())
finish_reason = agent_res["telemetry"]["run"]["finish_reason"]

Build your own evals

The harness pattern applies to any framework:
  1. Run the agent with deterministic mocks (or a fixed LLM in staging)
  2. Capture AgentRunResult fields — Content, LLMUsage, Telemetry
  3. Assert on behavioral signals, not just text match
For live LLM evals, swap the mock client in eval-harness/runner/setup/ with a real provider and keep the same output envelope.

Telemetry

Fields available for assertions

Benchmarks

Load and concurrency testing

Tools

Tool registration and execution

Readiness Checklist

Production deployment guidelines