Skip to main content
The benchmarks directory contains a standalone performance utility for Agent SDK for Go. It runs real agent.NewAgent execution loops under configurable load — mock LLM and tools by default — so you can measure latency, memory, CPU, token counts, and success rate without external API keys. Use it to stress-test orchestration behavior (multi-turn runs, tool batches, sub-agents, local vs Temporal runtime) before pointing the same harness at real LLMs and tools.

Prerequisites

Clone the SDK repository and run from the repository root:
git clone https://github.com/agenticenv/agent-sdk-go.git
cd agent-sdk-go
go run ./benchmarks/
Requires Go 1.26+. No LLM API key for default mock runs. Temporal scenarios need a running Temporal server.

What it measures

Most agent benchmarks focus on token throughput alone. Production workloads also depend on orchestration scaling: concurrent runs, multi-turn tool loops, sub-agent delegation, durable Temporal workflows, and stable memory use over hundreds of executions. Each benchmark session reports:
MetricDescription
Latency p50 / p95 / p99 / avgWall-clock per Run()
Heap and total allocationMemory delta over the session
Process CPU timeCPU usage
Input / output tokensFrom mock LLM stats (includes sub-agent LLM calls)
Success rateRun() completed without error
Memory recalls / storesWhen memory.enabled: true — from run telemetry
est_cost_usdPlaceholder 0 until pricing is configured
Reports are written to benchmarks/reports/ (JSON or text). Optional SDK logs go to benchmarks/logs/.

Quick start

From the repository root:
go run ./benchmarks/
Uses benchmarks/config.yaml by default — 100 sequential runs, local runtime, 3 tools, 2 sub-agents. Custom config:
go run ./benchmarks/ -config benchmarks/config.yaml
go run ./benchmarks/ -config /path/to/my-benchmark.yaml

How each run works

Each run calls agent.Run() once on a shared root agent. The mock LLM follows a fixed two-turn script:
  1. Turn 1 — returns tool calls for all registered tools (and sub-agent tools when configured)
  2. Turn 2 — returns final text after tool results are applied
Mock components apply configurable latency and jitter:
ComponentBehavior
Mock LLMBase latency + jitter; fixed token usage per call
Mock toolsbenchmark_tool_1benchmark_tool_N with latency + jitter
Sub-agentsReal SDK sub-agents (subagent-1, subagent-1.1, …) with the same mock script
Tool execution modesequential or parallel via WithAgentToolExecutionMode
Concurrency: one root agent is reused. When concurrent: true, runs execute in batches of concurrent_count goroutines.

Example scenarios

Fast local smoke test:
runtime: local
llm:
  latency_ms: 5
  jitter_ms: 0
tool:
  latency_ms: 2
  jitter_ms: 0
agent:
  runs: 10
  concurrent: false
  tools:
    count: 2
    execution: parallel
  subagents:
    count: 0
    levels: 0
Concurrent batches:
agent:
  runs: 100
  concurrent: true
  concurrent_count: 10
Temporal runtime — requires a running Temporal server:
runtime: temporal
temporal:
  host: localhost
  port: 7233
  namespace: default
  task_queue: agent-sdk-go
  workers_count: 0   # embedded worker in agent process
External root workers (workers_count: 1+) — spawns separate worker processes and enables EnableRemoteWorkers() on the root agent:
runtime: temporal
temporal:
  workers_count: 2
Manual worker:
go run ./benchmarks/worker -config benchmarks/config.yaml -worker-id 1

Configuration reference

All paths in config are relative to the repository root unless absolute.

runtime

ValueDescription
localIn-process SDK runtime — no Temporal required
temporalDurable execution — Temporal server must be running

agent

FieldDescription
runsNumber of Run() calls
concurrentSequential vs batched parallel runs
concurrent_countMax parallel runs per batch
tools.countMock tool count
tools.executionsequential or parallel
subagents.countSub-agents per level (0 to disable)
subagents.levelsMax nesting depth (1–5)

llm / tool

FieldDescription
latency_msBase delay per call
jitter_msRandom extra delay [0, jitter_ms]
mock_tokensTotal tokens reported per LLM call (~60% input / ~40% output)

memory

In-process inmem backend — no Docker required. Disabled by default.
FieldDescription
enabledWire WithMemory — recall before run, store after
store_modeondemand or always
user_idScope user ID via memory.WithContextUserID

output

FieldDescription
consolePrint report to stdout
fileWrite timestamped report file
dirReport directory (default benchmarks/reports)
formatjson or text

logger

FieldDescription
enabledWrite JSON SDK logs to files
dirLog directory (default benchmarks/logs)
leveldebug, info, warn, or error

Sample output

A default run (100 sequential runs, 3 tools, 2 sub-agents, local runtime) produces a text report like:
=== Benchmark Report ===
Config:        benchmarks/config.yaml
Runtime:       local
Runs:          100 (sequential)
Tools:         3 (parallel)
Sub-agents:    2 x 1 level

--- Latency (ms) ---
  p50:   42.1
  p95:   68.4
  p99:   91.2
  avg:   44.7
  min:    8.3
  max:   103.6

--- Throughput ---
  runs/sec:  22.4

--- Resources ---
  heap_alloc_mb:  18.2
  total_alloc_mb: 412.1
  cpu_time_sec:    4.8

--- LLM Usage ---
  input_tokens:   60000
  output_tokens:  40000
  total_tokens:  100000

--- Results ---
  success_rate:  100.0%
  failed:        0
  est_cost_usd:  0.00
JSON format (output.format: json) emits the same data as a structured object — suitable for CI assertion scripts and trend dashboards. Good results: success_rate: 100%, p99 latency within your SLO, heap stable across the run (not growing linearly with run count).

Real LLM and tools

LLM and tool calls are mocked by default for reproducible, zero-cost runs. To benchmark with real providers, replace the mock client and tool registry in benchmarks/setup/ — the harness structure (runs, concurrency, reporting, Temporal workers) stays the same. Latency, token counts, and cost will follow your provider; configure pricing separately for est_cost_usd.

Eval Harness

Behavioral regression without live LLM

Telemetry

Per-run metrics in benchmark telemetry

Temporal Runtime

Temporal benchmark configuration

Multiple Agents

Sub-agent trees in benchmarks