Fully Automated Prompt Optimization

Getting prompts right is the hardest part of building reliable LLM applications. Small wording changes can swing accuracy by 20%, what works on a few examples often fails at scale, and when a multi-step pipeline produces wrong answers, figuring out which step failed and why requires manually inspecting intermediate outputs.

Manual prompt engineering doesn't scale. You run evaluations one at a time, analyze failure cases by hand, tweak prompts based on intuition, and hope your changes don't break what was already working.

This post introduces FAPO (Fully Automated Prompt Optimization): a Claude Code-driven system that autonomously optimizes LLM pipelines from baseline prompts to production-ready accuracy. You provide a test dataset and initial prompt. FAPO evaluates, classifies failure patterns, generates improved variants, validates them through an independent reviewer, and iterates until your application reaches target accuracy, all orchestrated by Claude Code agents.

The Problem with Manual Optimization

Consider a multi-hop question answering pipeline that retrieves documents, extracts supporting facts, reasons over evidence, and formats a final answer. When this chain achieves 40% accuracy on your test set, where do you start?

  • No step-level visibility: Existing evaluation frameworks score final outputs but don't tell you which step in the chain caused the failure
  • Slow iteration: Each prompt change requires running full evaluations, waiting for results, manually analyzing what improved or regressed
  • Narrow optimization: Most prompt engineering tools optimize single prompts, not multi-step chains where failures cascade
  • No systematic diagnosis: Without tooling to classify failure patterns, you're guessing which prompt changes will help

FAPO addresses all of these. It provides pipeline-aware evaluation with step attribution, autonomous optimization orchestrated by Claude Code, and a closed-loop system that iterates from prompt-level changes to parameter tuning to structural chain redesigns when needed.

How FAPO Works

FAPO is a multi-tenant evaluation and optimization framework. A tenant is a self-contained optimization project: a single directory holding everything specific to one task: its prompts, dataset, chain definition, scorer, and configuration. Tenants are isolated from one another, so you can optimize many unrelated tasks side by side without interference.

The one input you must bring is a dataset: paired inputs and expected outputs that define what success looks like for your task. It is the ground truth the entire optimization loop measures against, so its quality and coverage directly determine how good the optimized prompts can get. The dataset is split into a validation set, which the optimizer iterates against to drive prompt improvements, and a held-out test set, used only for a final one-shot evaluation. The remaining pieces (the initial prompt, the LangGraph chain, and the scorer function) can all be generated by Claude: just describe your application and the evaluation criteria, and FAPO lets Claude scaffold the rest of the tenant for you.

Once these pieces are in place, FAPO runs them through a closed optimization loop, iterating autonomously until your target accuracy is reached:

Claude-Driven Optimization Loop
1
Evaluate
Run the chain on your test dataset, collect per-case scores and step-level outputs
2
Attribute
Classify failures by root cause using rule-based heuristics + LLM analysis
3
Propose
Generate new variant targeting dominant failure cluster (prompt / parameter / structure)
4
Review
Independent agent validates proposal for scope compliance, data leakage, scorer compatibility
5
Compare
Accept variant if it improves on previous best, reject otherwise
6
Iterate
Continue until target accuracy reached or optimization budget exhausted

Several design choices underpin this loop:

  • Two-layer architecture: A domain-agnostic core engine handles evaluation, chain execution, and scoring, while isolated tenants hold each task's prompts, datasets, and scorers, so unrelated projects optimize concurrently without interference.
  • Pipeline-aware scoring: FAPO captures the output of every chain step, not just the final answer, so failures can be attributed to the step that caused them: prompt-addressable (verbose answers, format errors) versus structural (retrieval gaps, missing nodes).
  • Claude Code orchestration: An optimization agent drives the loop while specialized subagents handle failure attribution and independent variant review. It works at three levels (prompt text, chain parameters, then chain structure), exhausting one before escalating.
  • Guardrails against overfitting: The optimizer inspects only training-split cases (val/test expose aggregate scores only), tenant playbooks define what may change, every variant is a new immutable file, and an independent reviewer checks each proposal for scope compliance and data leakage before it runs.

Research Evaluation: FAPO vs. GEPA

To validate FAPO's effectiveness, we evaluate it against GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art prompt optimization method that uses evolutionary search with genetic operators to optimize prompts for multi-step reasoning pipelines. GEPA represents the current benchmark for automated prompt optimization in complex chains.

We compare FAPO and GEPA across six benchmarks and three task models (GPT-4.1-mini, GPT-5.4-mini, Gemma 3-12B), using Claude Opus 4.6 as both FAPO's orchestrator and GEPA's reflector (prompt optimizer) model. Both systems start from identical baseline pipelines and prompts. FAPO begins with prompt-level edits but can escalate to structural changes when attribution identifies bottlenecks that prompts alone cannot resolve. GEPA is limited to prompt-level optimization.

FAPO vs. GEPA Across Six Benchmarks (Test Accuracy %)
Benchmark Baseline GEPA FAPO Gain vs. GEPA
HoVer* 35.9 48.5 83.8 +35.3pp
IFBench* 35.7 48.5 80.7 +32.2pp
LiveBench-Math 51.0 52.6 62.0 +9.4pp
HotpotQA 50.9 61.8 68.3 +6.5pp
Papillon 73.6 90.7 94.9 +4.2pp
AIME 16.7 16.0 12.9 -3.1pp

FAPO wins 15 of 18 model-benchmark comparisons with a mean gain of +14.1pp over GEPA (scores averaged across three task models). *On HoVer and IFBench, where FAPO escalated to pipeline changes, it wins all 6 model-benchmark pairs with a mean gain of +33.8pp; AIME is the only benchmark where GEPA leads.

Key Findings

  • Structural optimization matters: The largest gains (+35.3pp on HoVer, +32.2pp on IFBench) came from escalating beyond prompts to pipeline changes when attribution identified retrieval bottlenecks or format constraint failures.
  • Prompt-only still effective: On the 4 benchmarks without structural changes, FAPO wins 9 of 12 comparisons through prompt optimization alone.
  • Model-specific strategies: Optimal prompting approaches vary qualitatively across models even for identical tasks; the optimization agent discovers different strategies for each.
  • AIME within noise: FAPO's lone underperformance, on AIME, falls within the sampling noise of the benchmark; the gap is smaller than the standard deviation across stochastic trials, so it is not a statistically meaningful regression.

Getting Started with FAPO

Those benchmark results come from the same workflow you can run on your own task. The fastest way to start is to let Claude Code create all the tenant files for you. From the FAPO repository, start a Claude Code session and describe your task:

Prompt
Create a tenant named my_classification to categorize software names to a category.
Create all the relevant tenant files based on the template in docs/templates/tenant-docs
and other existing tenant examples. Evaluate with OpenAI gpt-4o model. The evaluation
criteria is exact match. I will add the dataset after you create all the tenant files.

Claude Code will generate:

  • tenants/my_classification/prompts/prompt.md — Initial prompt template with placeholders
  • tenants/my_classification/chains/classify_chain.py — LangGraph chain definition using built-in utilities
  • tenants/my_classification/code/my_scorer.py — Scorer function implementing validate_case and score_case
  • tenants/my_classification/configs/my_config.json — Eval configuration tying everything together
  • tenants/my_classification/docs/ — Tenant profile, data contract, prompt contract, optimization playbook

Add your dataset as tenants/my_classification/datasets/my_dataset.jsonl in the required format:

JSONL
{"case_id": "1", "task_type": "classification", "context": {"software_name": "Windows 11"}, "expected": {"category": "CAT_001 (Operating System)"}, "metadata": {}}
{"case_id": "2", "task_type": "classification", "context": {"software_name": "Google Chrome"}, "expected": {"category": "CAT_002 (Web Browser)"}, "metadata": {}}

Verify the setup with a test evaluation:

Bash
export OPENAI_API_KEY="sk-..."
python -m hephaestus.cli eval --config tenants/my_classification/configs/my_config.json

Then invoke the optimization agent:

Prompt
Run the optimization agent with the following parameters:
Tenant: my_classification
Config: configs/my_config.json
Success Criteria: composite_score >= 95

Claude Code will recognize the optimization agent and skills, produce a scope contract defining what's allowed to change, then autonomously iterate until the target is reached or the variant budget is exhausted. A summary of each evaluation can be viewed in evals/variant-XXX/summary.md:

Markdown
# Evaluation Summary

Total cases: 50

## Composite Score
- average: 100.00

## Score Breakdown
- exact_match: 100.00

## Step Timings
| Step     | Avg (s) | P50 (s) | P95 (s) |
|----------|---------|---------|---------|
| classify | 0.635   | 0.548   | 1.220   |

Claude updates three folders so every run is auditable:

  • prompts/variants/ — each improved prompt template, never edited in place
  • configs/ — a runtime config for every iteration
  • docs/ — per-variant scores, failure analysis, and key learnings from the optimization

To go deeper, see the GitHub repository for the full system overview, the tenant template docs for creating optimization targets manually, and the example tenants (HotpotQA, CTIBench-RCM, and more) for reference implementations.

FAPO also extends to optimizing ReAct agents. An MCP workflow evaluation extension adds an example tenant, mcp_example, that optimizes a tool-calling ReAct agent with trajectory scoring and LLM-as-Judge scoring examples. And while the loop is orchestrated by Claude Code by default, FAPO also supports Codex as the optimization agent.

Why This Matters

Prompt engineering is the bottleneck for LLM application quality, and existing tools treat prompts as atomic units rather than components of multi-step pipelines. FAPO closes that gap with pipeline-aware evaluation, autonomous optimization driven by Claude Code, and guardrails that prevent overfitting, turning prompt tuning from manual debugging into an automated optimization problem.

FAPO is open source. If you're using it, we'd like your feedback, and if you've built optimization strategies that generalize beyond your domain, consider contributing them back.