Zero Downtime

Agentic Engineering: From Architecture Document to Delivery Plan

April 24, 2026 · 3286 words · 17 min read

Architecture documents are often treated as the end of design work. In an effective engineering organization, they are the beginning of delivery work. The architecture document and the developer design decisions usually get converted to concrete executable task backlog for the engineering team. The engineering team lead and the program or product manager work together to perform this conversion.

Continue reading →

Agentic Engineering: Spec-Driven Development

April 23, 2026 · 3380 words · 17 min read

Spec-Driven Development (SDD) is the process of creating a living contract between human developers and coding agents where the Specification (the what and why) is deliberately decoupled from the Implementation (the how). SDD allows a human developer to become an architect who guides the agent to build and ship high quality software. In this blog I summarize my experience of using the SDD in software engineering. The prompts and the skills are from Paul’s SDD course — see the DeepLearning.AI course repo on GitHub.

Continue reading →

Agentic Engineering: Taming a Legacy Codebase

April 22, 2026 · 4816 words · 25 min read

Why This Article Exists

Every engineering team eventually inherits a codebase that has outgrown its original design. Features were shipped, deadlines were met, and somewhere along the way the foundations quietly cracked. Hardcoded secrets found their way into source control. async void started creeping into timer callbacks. Collections got shared across threads without locks. A comment saying // TODO: fix this properly quietly turned into a permanent resident.

Continue reading →

Reverb: A Semantic Cache That Knows When Its Answers Go Stale

April 22, 2026 · 1519 words · 8 min read

Caching LLM responses seems, at first glance, like a simple optimization. Record the prompt, record the answer, serve the answer next time the same prompt comes in. In practice it is a surprisingly deep problem, and the two standard approaches both fail in characteristic ways. Exact-match caches miss on anything short of a byte-identical prompt, which is almost never how users actually ask questions. TTL-based caches serve confidently-stale answers for hours after the underlying knowledge base has changed — the classic hallucination vector dressed up as “we cached it.”

Reverb is a Go library and standalone service that addresses both failure modes. It combines a two-tier cache (exact SHA-256 match, then embedding-cosine similarity) with knowledge-aware invalidation: every cached entry tracks the source documents it was derived from, and a change-data-capture pipeline evicts entries by causality when their sources change. TTLs become a backstop, not the primary correctness mechanism.

Continue reading →

MultiTrust: Subjective Logic as a Runtime for Multi-Agent Trust

April 22, 2026 · 1278 words · 7 min read

In multiagent systems, trust of an agent is a valuable asset since it gives them an ability to reason about their future collaboration, coordination, and plan. Most “trust score” implementations in agentic systems are a single float between 0 and 1. That number is doing two jobs at once — representing how much positive evidence an agent has accumulated, and representing how confident the system is in that judgment — and it collapses them into a value that makes the two indistinguishable. A brand-new agent with no history and a seasoned agent that has run 10,000 tasks with an even win/loss record both land at 0.5. The scalar has no room to say “I don’t know yet.”

MultiTrust fixes this by reaching for the right math. It represents trust as a Subjective Logic opinion — a triple of (belief, disbelief, uncertainty) that sums to one — and exposes the whole machinery as an MCP server, so any Model Context Protocol-aware agent can consult it as a standard tool call.

Continue reading →

Tangle: Deadlock and Livelock Detection for LangGraph Agents

April 22, 2026 · 1468 words · 8 min read

Multi-agent LLM workflows are, from a concurrency standpoint, small distributed systems. They hold resources, they wait on each other, and — like every other distributed system we have ever built — they can get stuck. The failure mode is worse than an outright crash: no exception is raised, no timer fires, no agent knows anything is wrong. The workflow just stops producing tokens. The operator sees a spinner.

Tangle is a small Python library that catches this class of failure in real time for LangGraph workflows (and, via OpenTelemetry, for anything else). It reuses an idea that has been sitting in operating-systems textbooks since 1972 — the Wait-For Graph — and applies it at the agent layer, where the same topology has quietly reappeared. To be specific, in its current implementation, Tangle provides repeated-pattern detection over message digests.

Continue reading →