Juho Choi

Project bootstrap

Sun, 10 May 2026 03:29:18 GMT

Project bootstrap

Start a project with a deliberately minimal harness — interactive USER gates, docs-first, no autonomous loop — and graduate to autonomy only after customer evidence (personas, interviews, simulation reports) appears.

When to use

A brand-new agentic project with no existing codebase and no customer data.
Solo or small team where the user holds the domain knowledge and the AI handles implementation.
The project will run long enough that docs (architecture, ADRs) get re-read by future sessions.
You want the AI to implement, not author the spec.

When not to use

Retrofitting an existing codebase — docs already exist or the code is the source of truth; the docs-creation step doesn't fit.
A one-off script or short experiment — boilerplate overhead exceeds value.
You already have a complete written spec — skip docs-creation, go straight to per-feature planning.
Mature project with persona / customer-evidence inputs — switch to an autonomous-loop pattern with a critic agent.

A vague product idea handed to an agent fails in two ways: endless clarifying questions that stall, or silent invention that ships code missing the point. The bootstrap shape forces a documented intermediate step — the spec discussion happens once, gets crystallized into a fixed set of docs, and every downstream phase session reads those docs instead of re-discussing.

The "Stage 0" qualifier is deliberate. Mature harness patterns (autonomous loops, persona simulation, evidence-based critic agents) need inputs that do not exist at project start. Forcing them in too early either rejects everything (no evidence to evaluate against) or rubber-stamps (the critic abandons its own rule). USER gates are the only honest substitute until customer evidence accumulates.

Pattern

Three structural decisions:

1. Docs are the first AI-generated artifact. Before any feature code, the user discusses the product in an interactive session, then triggers a docs-creation prompt. The AI writes a fixed set (e.g. PRD, user flow, data schema, code architecture, ADRs) that becomes the single source of truth for everything downstream.

2. USER replaces the critic agent at three gates. The build skill pauses for user input at clarification, plan approval, and task-creation approval. No agent autonomously approves its own plans. The cost is latency; the benefit is preventing intent-drift in the absence of customer evidence.

3. Phase isolation in ephemeral sessions. Once a task is approved, an external runner spawns one fresh claude -p per phase, sequentially. Each phase reads docs + previous phase outputs from disk. No long-lived session that accumulates context across phases.

Minimal file shape:

prompts/
  docs-create.md          # guidance for the foundational docs
  task-create.md          # guidance for task / phase files
.claude/skills/
  plan-and-build/         # docs → discuss → plan → run phases
  commit/                 # disciplined commit playbook
scripts/
  run-phases.py           # one ephemeral claude -p per phase

User flow:

claude                                              # interactive
  ↓ user describes product
  ↓ user invokes prompts/docs-create.md
docs/{prd, flow, data-schema, code-architecture, adr}.md
  ↓ user invokes plan-and-build skill
plan-and-build  (3 user gates: clarify → plan → task-create)
  ↓
tasks/N-name/{index.json, phase{0..K}.md}
  ↓ scripts/run-phases.py N-name
phase 0 → phase 1 → ... → phase K                  # each in its own claude -p
  ↓
async notification (Discord or equivalent)

Trade-offs

Setup cost. You cannot just claude and start coding; the boilerplate must exist first. Mitigated by publishing it once as a GitHub template repo so cloning is one command.
Three blocking gates per task. Each plan-and-build invocation pauses for user response three times. Fine for solo synchronous work; latency for async teams.
Doc-code drift risk. Docs are authoritative only if maintained. Phase 0 of every task is a docs-update step that catches most drift, but stale ADRs can sneak through if not pruned.
Five docs is opinionated. PRD + flow + data-schema + code-architecture + ADR is a default, not a prescription. Tiny projects may only need PRD + ADR; complex ones may need more. Adjust the docs-creation prompt.
Async notification is a hard prerequisite for unattended runs. Without Discord or equivalent, the user must watch the terminal during phase execution (which can take hours). Set this up before relying on the pattern for anything longer than a single short task.

Example

claude-harness-template is a concrete Stage 0 boilerplate implementing this pattern. Nine files. Published as a GitHub template repo:

gh repo create new-project \
  --template jh-choi98/claude-harness-template \
  --clone

The template was extracted from a longer-running internal harness (argos) by reading its git log to identify what existed at MVP era versus what was added later for Stage 2+ features (persona simulation, autonomous loops, read-only critic agents). The bootstrap shape was the residue after stripping the Stage 2+ assets that depend on customer evidence.

None yet. Future entries in this category will likely cover the Stage 2 graduation (persona-driven ideation, autonomous loops with rollback, read-only critic agents) and per-task git workflow choices (main-direct vs. feature branch).

Thin hook manifest

Mon, 04 May 2026 18:29:44 GMT

Thin hook manifest

Have the hook config file declare only what events to subscribe to and route every one of them to a single handler binary; keep all branching, transformation, and policy in that handler.

When to use

Building external tooling (telemetry, audit, policy enforcement) on top of a harness that exposes events through a user-edited config file (e.g. Claude Code's .claude/settings.json).
You expect the integration's logic to evolve faster than the user's machine can be re-configured.
You need typed payload handling, unit tests, or per-event branching that the manifest schema ({matcher, command}) can't express.
Many event kinds want the same treatment ("log everything"), so per-event configuration would just be repetition.

When not to use

A one-off shell hook that does one thing and never changes — "command": "echo done >> ~/log" directly in the manifest is fine; wrapping it in a CLI is overkill.
The manifest schema is already expressive enough for the full policy.
You don't control a fast distribution channel for the handler. If updating the handler is harder than updating the manifest, the leverage flips and the manifest is the right place for logic.
Hot paths where per-event handler-process spawn cost matters more than the cost of fanning out matchers in the manifest — then push filtering into the manifest.

Context

The pattern is illustrative, not a law. It earns its keep when both of these are true: the manifest schema is too narrow for the real policy, AND the handler ships through a faster channel than the manifest does.

Anything baked into the manifest becomes a deployment artifact — changing it forces every user to re-install or hand-edit. Anything in the handler propagates the next time the user updates the package. Hook config schemas are also typically narrow on purpose: matcher + command. Real policy ("truncate this field, parse the transcript file, drop sub-agent events, count tokens") has nowhere to live in JSON.

Pattern

Two parts, with the asymmetry on purpose.

Manifest (thin). Every event of interest points at the same command. No per-event command, no matcher cleverness. Adding a new event type is one line.

{
  "hooks": {
    "PreToolUse":   [{ "matcher": "", "hooks": [{ "type": "command", "command": "myagent hook" }] }],
    "PostToolUse":  [{ "matcher": "", "hooks": [{ "type": "command", "command": "myagent hook" }] }],
    "SessionStart": [{ "matcher": "", "hooks": [{ "type": "command", "command": "myagent hook" }] }],
    "Stop":         [{ "matcher": "", "hooks": [{ "type": "command", "command": "myagent hook" }] }]
  }
}

Handler (thick). One entry point. Every event lands here and branches on the event name in the payload. This is where typed parsing, transcript reads, network sends, truncation, and the never-block safeties live.

myagent hook  (single binary)
  ├─ read JSON from stdin (with short timeout)
  ├─ branch on payload.hook_event_name
  │     ├─ PreToolUse → record tool args
  │     ├─ Stop       → parse transcript, summarize
  │     └─ ...
  ├─ ship to backend (detached, fire-and-forget)
  └─ exit 0 always (try/finally)

Corollaries:

Policy changes ship through the handler's package manager (npm publish, pip upload), not user re-configuration.
Branching is a single-source-of-truth file; reading it tells the whole story of what the integration does.
"Never block the harness" safeties — stdin timeout, detached I/O, catch-all exit 0 — are implemented once at the single sink, not duplicated per event.
The handler's core functions (e.g. buildPayload, convertEventType) are ordinary pure functions, easy to unit-test.

Trade-offs

Process spawn per event. Every tool call spawns the handler. The cost is real (tens of ms each) and adds up on busy sessions. If measurement shows it matters, push a coarse matcher into the manifest to skip uninteresting events at the source.
Opaque manifest. A reader looking at the user's settings.json sees only myagent hook and learns nothing about what runs. You're trading manifest legibility for handler legibility — document the payload contract somewhere reachable.
Single point of failure. If the handler binary isn't on PATH, every event fails. A common mitigation is to make the SessionStart entry self-bootstrapping (command -v myagent || install ...; myagent hook).
Coarse user controls. Users can't easily say "track Bash but not Read" — the manifest doesn't know about the distinction. If user-tunable filtering matters, expose it via the handler's own config or env vars, not the manifest.

Example

The same shape recurs across infrastructure: webhook receivers (URL registration vs. receiver code), Kubernetes (Service/Ingress YAML vs. controller), AWS Lambda + API Gateway (route vs. function), GitHub Actions (on: vs. step script), systemd (ExecStart= vs. binary). The manifest answers what/when, the handler answers how.

A textbook instance for Claude Code is the Argos telemetry CLI: its .claude/settings.json injects identical argos hook entries for every event type, and a single hook command in the CLI does all routing internally.

(none yet)

Human intervention log

Sun, 03 May 2026 01:10:12 GMT

Human intervention log

A single append-only file where an autonomous harness records every task it could not automate, who handled it manually, and the condition under which automation could resume.

When to use

The harness runs an autonomous loop (e.g. ideate → plan → build → commit → check) that is meant to keep going without a human in each iteration.
The loop will inevitably hit work that is physically outside its reach: console-only API key issuance, vendor/legal approvals, DNS at the registrar, payments, security-incident judgment calls, library or platform limitations the agent cannot bypass.
You want future iterations (or future maintainers) to know why a workaround exists and when it would be safe to retry the automated path.

When not to use

One-shot or short-lived agents — there is no later iteration that will read the log.
Copilot-style interactive harnesses where human turns are the design, not the exception. Every action would qualify and the log degenerates into a transcript.
Cases where the limit is genuinely permanent and uninteresting (e.g. "only the CEO can sign this contract"). A single static note in a runbook is enough; you do not need a log entry per occurrence.

Context

An autonomous loop hits two kinds of failure: ones it can retry, and ones it physically cannot. If the human silently absorbs the second kind, three things are lost:

The audit trail — six months later nobody remembers why the metric was switched from successRate to medianDurationMs.
The retry trigger — the condition that would let the loop reclaim this task is in someone's head, not in the repo.
Self-knowledge — the harness has no list of its own ceilings, so it keeps re-attempting impossible work and burning tokens.

The pattern is to make every manual override an explicit, structured entry instead of an undocumented save.

Pattern

Maintain one append-only markdown file in the repo (e.g. docs/human-intervention.md). Each intervention gets one section with four required fields:

## YYYY-MM-DD — short title

- Context: why the harness could not handle it
- Actor: who intervened
- Action: what they actually did
- Re-automatable: yes/no — <trigger condition, with a concrete check>

The four fields are the minimum needed for retrospection, retry, and audit. Drop one and the entry stops being useful:

Context answers "why was this not automated."
Actor answers "who owns the follow-up."
Action answers "what is the current state of the system."
Re-automatable answers "when, if ever, should the loop try again?" The trigger should be checkable without judgment — a SQL query, a feature-flag probe, a vendor-changelog URL, an issue link. A vague "someday when the API improves" is not a trigger.

The harness can optionally read the file (or an index of it) at boot and treat listed items as known off-limits until their trigger fires.

Trade-offs

Discipline tax. The log is only as good as the operator's habit of writing entries. Half-logged interventions are worse than none — they imply completeness that is not there.
Drift. Triggers go stale (vendors ship APIs, internal limits change). The file needs periodic pruning, otherwise old entries fossilize and the loop never retries things it could now handle.
Not a fix. The log makes the autonomy ceiling visible, it does not raise it. A monotonically growing file is a signal the harness is accumulating debt, not paying it down.

Example

A real entry from an autonomous harness: the loop generated a feature request to add a successRate column to its agent-run dashboard. Investigation revealed the underlying tool runner did not surface exit_code for built-in tools, so the data simply did not exist. The intervention recorded:

Context: success/failure data is unavailable for built-in tools at the data-source layer.
Actor: repo owner.
Action: replaced the proposed successRate column with medianDurationMs, which can be computed from existing fields.
Re-automatable: yes — when the tool runner ships exit-code support, OR when a sampled SQL query against the runs table starts returning non-null exit codes. The query is pinned in the entry so a future iteration can evaluate the trigger automatically.

Because the trigger is concrete, a future loop can run the query, see the data has appeared, and reopen the original feature request without a human re-deciding the question.

(none yet — see also runbooks, ADRs, and postmortems in general engineering practice; this pattern is the autonomous-loop-specific variant focused on retry conditions.)