7 · Temporal Validity, Benchmarks & Invalidation
This cluster is what turns AgentMark from documentation into a living architecture contract. In AI, every claim expires — so the language makes expiry, evidence, and automatic invalidation first-class.
Temporal validity — every claim has a clock. Any time-sensitive assertion carries a validity window.
assert: Claude Code best MCP support
valid:
from: 2026-06-01
review: 2026-06-15
The ecosystem changes weekly; without as_of / review_by / valid, a document silently lies.
Benchmarks (bench) and tests (test) — the killer feature. Claims should expire unless backed by fresh tests. A bench defines candidates, tasks, metrics, and a repeat schedule.
bench CODE-HARNESS-2026-06:
title: Coding harness comparison for Hermes
run_on: 2026-06-02
owner: Joel Tong
candidates: [Claude Code, Codex, GLM, Opus, Qwen Coder]
tasks:
- edit multi-file repo
- run tests
- fix failing test
- connect to MCP server
- choose correct MCP tool
- summarize patch
metrics:
- accepted_patch_rate
- cost_per_accepted_patch
- mcp_tool_success_rate
- latency
- human_interventions
- context_loss_events
schedule:
repeat: weekly
Claims then bind to that evidence and declare their own kill conditions:
claim C-003:
text: GLM is cheapest among credible coding-agent substitutes.
evidence: [bench#CODE-HARNESS-2026-06]
invalid_if:
- cost_per_accepted_patch is no longer lowest
review_by: 2026-06-16
Invalidation logic — the document tells you when it's stale. A decision lists the exact conditions under which it stops being valid.
decision D-001:
title: Use Codex for Hermes
chosen: Codex
supported_by: [C-002, C-003]
constrained_by: [K-001]
invalid_if:
- K-001 is false
- C-002 expires
- C-003 confidence < medium
- MCP-SMOKE success_rate for Codex < 0.8
- GLM cost_per_accepted_patch < Codex by 40%
Now tooling can emit: "This architecture is stale. Decision D-001 depends on claim C-003, which expired on 2026-06-16." That is the difference between a diagram that quietly rots and a contract that announces its own decay.