Aubrey Clark / Writing

🤖 AI-assisted draft — February 20, 2026


Sub-Agent Trees: Hierarchical Task Decomposition for AI

I had a 56,000-word document that needed a complete redesign. Doing it in one shot would produce mush. Doing it sequentially would take days. So I broke it into a tree, built an orchestrator, and ran the whole thing across three AI providers in parallel. The result: 227,000 characters of production-ready output in 2 minutes 14 seconds, reviewed by three independent models in 86 seconds, and all fixes verified in 107 seconds.

This post explains the idea, shows the code, and walks through the actual run.

The Idea

Treat complex AI tasks like a compiler treats code: break the problem into independent pieces, process them in parallel, then combine the results bottom-up.

Tasks form a DAG (directed acyclic graph). Some depend on others; most don't. The orchestrator figures out which can run simultaneously, groups them into layers, and fires each layer in parallel. Outputs from one layer feed into the next.

          [Merge]              Layer 2: synthesize everything
          /      \
   [Synthesis A] [Synthesis B]  Layer 1: combine leaf outputs
    /    |    \    /    |    \
 [T1]  [T2]  [T3] [T4] [T5] [T6]   Layer 0: independent leaves

Layer 0 tasks run simultaneously. Layer 1 waits for Layer 0 to finish. A merge agent optionally reads everything and produces the final output.

The Code

~300 lines of Python. No framework, no LangChain, no abstractions-on-abstractions. Just async Python and YAML. Four parts:

1. Providers (~80 lines). Each AI provider is a class with one method: complete(system, prompt, model) → str. Four providers ship out of the box: Anthropic, OpenAI, Google, xAI. Adding a provider is five lines. xAI uses the OpenAI client with a different base_url because Grok's API is OpenAI-compatible.

2. Data structures (~30 lines). Two dataclasses. A Task has an id, prompt, provider, model, dependencies, files to read, and a place to put its output. A Plan is a list of tasks plus optional merge config. No inheritance hierarchy, no plugin system, no registry pattern.

3. The orchestrator (~150 lines). Does five things:

4. YAML plan loader (~30 lines). Plans are YAML files. Each task specifies what to do, which model to use, what files to read, and what to depend on. The loader turns this into dataclasses. That's it.

Design Decisions

No framework. The entire thing is stdlib Python plus four provider SDKs. When you read the code, you see the code.

Async all the way down. asyncio.gather is the only concurrency primitive. No threads, no processes, no queues.

Files as interface. Each task writes its output to a file. You can inspect intermediate results during execution, re-run individual tasks, or hand-edit outputs and re-run only the merge.

Provider as parameter. Each task declares its provider and model. The orchestrator doesn't care which model runs which task. Put GPT on the hard reasoning tasks, Grok on the simple ones, Gemini on the synthesis.

Phase 1: Design (the tree from the original post)

The first use was redesigning a 56,000-word financial planning document. Nine agents across three layers produced a detailed blueprint: the file tree, module specifications, routing logic, state schema, and build order.

            [Final Blueprint]              Layer 2 (Opus)
               /            \
      [System Arch]      [Content Plan]    Layer 1 (Gemini, Opus)
       /     |    \       /     |     \
    Router Module State  Data  Audit  AI    Layer 0 (Opus, GPT-5.2)

Nine agents. Three layers. Three providers. The blueprint said: 22 files, 4-week critical path if done manually.

Phase 2: Build

Then I used the blueprint to build the actual system. 21 modules in parallel (Layer 0), plus an integration check (Layer 1), plus a merge.

  Layer 0 (21 tasks in parallel):
    OpenAI GPT-5.2:  router, quick-start, spending, tax-leaks,
                     retirement-accounts, goals-and-budget, savings-waterfall
    xAI Grok:        state-schema, module-template, cash-and-debt, benefits,
                     equity-comp, action-plan, philosophy, principles,
                     glossary, career, readme
    Google Gemini:   income, net-worth, portfolio

  Layer 1 (1 task):
    GPT-5.2:         integration-check (reads all 21 outputs)

  Merge:
    GPT-5.2:         final reconciliation

Results:

The same work done sequentially on a single model would take 20-30 minutes. Parallelism across providers turns hours into minutes.

Phase 3: Multi-Model Review

Here's where it gets interesting. Different models have different blind spots. So I ran the same review task on three providers in parallel, then reconciled the results:

  Layer 0 (3 reviewers, same prompt, same input):
    GPT-5.2:   "Review all 24 files for financial accuracy,
                routing logic, consistency, UX, gaps..."
    Grok:      (same prompt)
    Gemini:    (same prompt)

  Layer 1:
    GPT-5.2:   Reconciliation — read all 3 reviews,
               apply consensus filter

  Merge:
    GPT-5.2:   Top 5 fixes

The consensus filter: only include issues flagged by 2 or more reviewers. Single-reviewer issues go in a "disputed" section. This reduces false positives and surfaces the real problems.

Results:

The three models caught different things. GPT flagged hardcoded tax limits and schema inconsistencies. Grok caught placeholder modules being routed as if they were real. Gemini found prerequisite enforcement gaps. No single model found everything. Together they produced a comprehensive review.

Phase 4: Apply Fixes

The final pipeline applied all review fixes:

  Layer 0 (3 parallel):
    Create limits.json (GPT-5.2)
    Fix router (GPT-5.2)
    Fix placeholder modules (Grok)

  Layer 1 (5 parallel):
    Update retirement module (GPT-5.2)
    Update benefits module (Grok)
    Update savings waterfall (Grok)
    Update income module (Gemini)
    Fix tax-leaks safe harbor (GPT-5.2)

  Layer 2:
    Verification (Gemini)

  Merge:
    Ship decision (GPT-5.2)

Results: 9/9 ✅ in 107 seconds. All fixes verified. "V1 READY TO SHIP."

The Full Pipeline

PhaseTasksTimeOutput
Design (tree)9~3 minBlueprint
Build222m 14s227K chars, 22 files
Review41m 26s3 reviews + reconciliation
Fix91m 47sAll fixes applied + verified
Total44~8.5 minComplete v1 system

Estimated API cost for the entire pipeline: $1-3.

Limitations

No streaming. Tasks complete atomically. You see nothing until a task finishes.

No retry logic. If a task fails (auth error, rate limit, timeout), it writes the error to the output file and moves on.

Context truncation. Shared context truncates each task's output to 2,000 characters. Direct dependency outputs are passed in full, but shared context is lossy.

8,192 token output limit. Hardcoded across all providers. Some tasks want more.

No cost tracking. The manifest records timing but not token counts. You check your provider dashboards after the run.

When to Use This

Sub-agent trees work when the task is decomposable, you need multiple perspectives, sequential processing would be too slow, and the synthesis step is well-defined.

They don't work when every step depends on the previous one, the task requires a single coherent voice, or context that emerges mid-task would change everything.

Source

The orchestrator is open source: github.com/abclark/sub-agent-trees. MIT license. ~300 lines. No dependencies beyond the provider SDKs.


The system built in this post is Open Plan, an open-source financial planning system.

← Writing | Home