AgentDevel: Framing Agent Improvement as a Software Release Engineering Pipeline

Moving beyond "self-reflection" to auditable, regression-aware release candidates (RCs) and flip-centered gating.

Read Paper (arXiv:2601.04620)

A. RUN & OBSERVE

Implementation-Blind Surface Signals

Dtrain 10k Cases
Current Agent (Ab_t)
log_1.txt
Trace(τt)
Error!
Click to Scan!
Critic
Symptom: Missing Step
Symptom: Invalid Arg

B. DIAGNOSE & SYNTHESIZE

Executable Engineering

Analysis Engine
Executable
Diagnostic Scripts

Processing Patterns...

Release
Candidate

v.RC-1 #a7f2
(btRC)

C. FLIP-CENTERED GATING

The Regression Firewall

RELEASE GATE CONTROL
P→P Stable Pass 9,450
P→F REGRESSION 3 Cases
F→P (FIX) 142 Fixed
F→F Persistent Fail 405
REJECT if:
Unacceptable P→F Regressions
DISCARD RC
ACCEPT if:
High Fixes (F→P) &
Low Regressions
NEXT OFFICIAL
VERSION (bt+1)