CI for agent behavior

Every correction becomes a test your agent has to pass.

Today your agent can remember a correction and still repeat the mistake, so you re-teach the same things every session. otto makes each correction a check the agent has to clear before it can claim a task is done. Teach it once; the lesson holds; the gains don't slip back.

View on GitHub Read the thesis ->

What otto is

Other tools test the output. otto tests the conduct.

CI gates code: a change gets tested, and if it fails it doesn't ship. otto points that at behavior. Braintrust and Adaline test output quality: is the answer good. otto tests conduct: does the agent prove its work, stop at the right doors, and not repeat what you already corrected. (Conduct under pressure is what culture means; the shorthand is Culture CI.)

correction->proposal->ratification->Standard / Practice->Behavior Check->receipt

What it looks like

Your agent claims a task is done without proof. You correct it. "Stop calling it done without showing me the test output."

Step 01

A correction arrives

It enters Curation before any behavior can compound.

Step 02

otto proposes a check

"Done" now requires proof mapped to acceptance criteria.

Step 03

You ratify it

The check becomes a Standard the agent runs against itself. A suggestion is never canon.

The ratification moment

The human ratifies. otto records the proof.

RECEIPTSTD-04

Change

"done" now requires proof mapped to acceptance criteria

Effect

future completion claims blocked until a test, log, or artifact exists

Ratified by

you · 2026-06-14 · reversible, revocable

The boundary

The agent owns the steps. The human owns the doors.

otto lets you delegate reversible work without surrendering consequential judgment. The goal is not to ask more questions - only the right ones.

Two-way doors

The agent acts.

Reversible work runs without interruption. The reversible surface widens as the agent earns trust.

One-way doors

The human ratifies.

Consequential, irreversible work stops for a person. That door never relaxes.

What otto is not

Not output scoring.

Braintrust and Adaline measure whether the answer is good. otto enforces how the agent behaves and gates what it can't undo. Enforcement, not measurement.

Not a memory engine.

Letta is what the agent knows. otto is what it's trusted to do with it.

Not conversation design.

Parlant keeps a conversation on rails. otto holds conduct across whole runs and tool calls.

Not a work manager.

Paperclip decides who does what work. otto decides what's safe to run unsupervised.

A value that cannot refuse you is decoration.

Install

Built to be installed by an agent.

01Have your agent install it

Paste this into Claude Code, Codex, Cursor, or any coding agent. It wires otto over your Letta runtime and verifies the install end to end.

# paste into your coding agent
Retrieve and follow the instructions at:
https://raw.githubusercontent.com/otto-haus/otto/main/INSTALL_FOR_AGENTS.md

02Set it up yourself

brew install go-task
git clone https://github.com/otto-haus/otto.git
cd otto
bun install
bun run install-extension   # then run /reload in Letta Code

otto runs as a behavior layer over Letta in local mode. The only hard dependency is Letta Code - no Docker, no database, no server. Humans need Bun and go-task for the task shortcuts.

Status

otto is early, and honest about it.

v0.1 is a local-first, file-backed release. Progress is one thing: whether a real correction can travel the whole loop - proposed, ratified, recorded, and reflected in the next run. When one rejection changes what the agent does next, otto is real. Everything before that is a drawing of the machine.

Practices · shipped Desktop · shipped Standards · proposed Curation · next