agent.NewAgent execution loops under configurable load — mock LLM and tools by default — so you can measure latency, memory, CPU, token counts, and success rate without external API keys.
Use it to stress-test orchestration behavior (multi-turn runs, tool batches, sub-agents, local vs Temporal runtime) before pointing the same harness at real LLMs and tools.
Prerequisites
Clone the SDK repository and run from the repository root:What it measures
Most agent benchmarks focus on token throughput alone. Production workloads also depend on orchestration scaling: concurrent runs, multi-turn tool loops, sub-agent delegation, durable Temporal workflows, and stable memory use over hundreds of executions. Each benchmark session reports:| Metric | Description |
|---|---|
| Latency p50 / p95 / p99 / avg | Wall-clock per Run() |
| Heap and total allocation | Memory delta over the session |
| Process CPU time | CPU usage |
| Input / output tokens | From mock LLM stats (includes sub-agent LLM calls) |
| Success rate | Run() completed without error |
| Memory recalls / stores | When memory.enabled: true — from run telemetry |
est_cost_usd | Placeholder 0 until pricing is configured |
benchmarks/reports/ (JSON or text). Optional SDK logs go to benchmarks/logs/.
Quick start
From the repository root:benchmarks/config.yaml by default — 100 sequential runs, local runtime, 3 tools, 2 sub-agents.
Custom config:
How each run works
Each run callsagent.Run() once on a shared root agent. The mock LLM follows a fixed two-turn script:
- Turn 1 — returns tool calls for all registered tools (and sub-agent tools when configured)
- Turn 2 — returns final text after tool results are applied
| Component | Behavior |
|---|---|
| Mock LLM | Base latency + jitter; fixed token usage per call |
| Mock tools | benchmark_tool_1 … benchmark_tool_N with latency + jitter |
| Sub-agents | Real SDK sub-agents (subagent-1, subagent-1.1, …) with the same mock script |
| Tool execution mode | sequential or parallel via WithAgentToolExecutionMode |
concurrent: true, runs execute in batches of concurrent_count goroutines.
Example scenarios
Fast local smoke test:workers_count: 1+) — spawns separate worker processes and enables EnableRemoteWorkers() on the root agent:
Configuration reference
All paths in config are relative to the repository root unless absolute.runtime
| Value | Description |
|---|---|
local | In-process SDK runtime — no Temporal required |
temporal | Durable execution — Temporal server must be running |
agent
| Field | Description |
|---|---|
runs | Number of Run() calls |
concurrent | Sequential vs batched parallel runs |
concurrent_count | Max parallel runs per batch |
tools.count | Mock tool count |
tools.execution | sequential or parallel |
subagents.count | Sub-agents per level (0 to disable) |
subagents.levels | Max nesting depth (1–5) |
llm / tool
| Field | Description |
|---|---|
latency_ms | Base delay per call |
jitter_ms | Random extra delay [0, jitter_ms] |
mock_tokens | Total tokens reported per LLM call (~60% input / ~40% output) |
memory
In-process inmem backend — no Docker required. Disabled by default.
| Field | Description |
|---|---|
enabled | Wire WithMemory — recall before run, store after |
store_mode | ondemand or always |
user_id | Scope user ID via memory.WithContextUserID |
output
| Field | Description |
|---|---|
console | Print report to stdout |
file | Write timestamped report file |
dir | Report directory (default benchmarks/reports) |
format | json or text |
logger
| Field | Description |
|---|---|
enabled | Write JSON SDK logs to files |
dir | Log directory (default benchmarks/logs) |
level | debug, info, warn, or error |
Sample output
A default run (100 sequential runs, 3 tools, 2 sub-agents, local runtime) produces a text report like:
output.format: json) emits the same data as a structured object — suitable for CI assertion scripts and trend dashboards.
Good results: success_rate: 100%, p99 latency within your SLO, heap stable across the run (not growing linearly with run count).
Real LLM and tools
LLM and tool calls are mocked by default for reproducible, zero-cost runs. To benchmark with real providers, replace the mock client and tool registry inbenchmarks/setup/ — the harness structure (runs, concurrency, reporting, Temporal workers) stays the same.
Latency, token counts, and cost will follow your provider; configure pricing separately for est_cost_usd.
Related
Eval Harness
Behavioral regression without live LLM
Telemetry
Per-run metrics in benchmark telemetry
Temporal Runtime
Temporal benchmark configuration
Multiple Agents
Sub-agent trees in benchmarks