A cutting-edge framework for adaptive multi-agent LLM collaboration with reinforcement learning-based orchestration, based on the "Puppeteer" model from the paper:
Based on the research: π Puppeteer: Adaptive Multi-Agent Orchestration with Reinforcement Learning
The DOA Framework implements a learnable orchestrator that dynamically selects which agents to activate based on the current task state. Unlike static multi-agent systems, our orchestrator continuously improves through reinforcement learning, learning to:
- π― Optimize agent selection for better task performance
- β‘ Minimize computational costs through efficient orchestration
- π Adapt to complex reasoning patterns including cycles and hubs
- π Self-improve via REINFORCE-based policy optimization
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Task Input βββββΆβ Orchestrator βββββΆβ Agent Pool β
βββββββββββββββββββ β β β β
β Policy Network β β β’ EchoAgent β
βββββββββββββββββββ β (Neural Net) β β β’ TerminatorAgentβ
β Reward Signal ββββββ β β β’ CustomAgents β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β² β
β βββββββββββββββββββ
ββββββββββββββββ REINFORCE β
β Trainer β
βββββββββββββββββββ
# Clone the repository
git clone https://github.com/your-org/dynamic-orchestrator-agent.git
cd dynamic-orchestrator-agent
# Install dependencies
pip install torch numpy dataclasses-json typing-extensions
# Or use Poetry
poetry install
python examples/run_mvp_training.py
This will start training the orchestrator to learn optimal agent selection patterns!
π Starting DOA Framework MVP Training
Epochs: 50, Episodes per epoch: 10
State embedding dim: 64, Hidden dim: 128
Learning rate: 0.001, Max steps: 4
------------------------------------------------------------
Initialized 2 agents: ['EchoAgent', 'TerminatorAgent']
Reward config: Ξ»=0.1, Ξ³=0.99
Policy network: 17154 parameters
All components initialized successfully!
============================================================
Epoch 1/50 | Avg Reward: -0.234 | Success Rate: 20.0% | Loss: 0.45123
Epoch 2/50 | Avg Reward: -0.156 | Success Rate: 30.0% | Loss: 0.38901
...
Epoch 50/50 | Avg Reward: 0.823 π | Success Rate: 90.0% | Loss: 0.12456
Central coordinator that uses a neural policy to select agents dynamically.
Neural network that learns to map system states to agent selection probabilities.
Standardized interface for implementing custom agents.
Policy gradient trainer that optimizes the orchestrator's decision-making.
Configurable reward function balancing task success and computational efficiency.
- π§ Learnable Orchestration: Neural policy learns optimal agent selection
- βοΈ Cost-Performance Balance: Configurable Ξ» parameter for cost vs. accuracy trade-offs
- π Dynamic Topologies: Supports complex reasoning patterns including cycles
- π Continuous Improvement: REINFORCE-based learning from experience
- π Modular Design: Easy to add new agents and tools
- π Rich Observability: Comprehensive trajectory logging and metrics
- π€ Multi-Agent AI Systems: Coordinate specialized AI agents for complex tasks
- πΌ Business Process Automation: Optimize workflows with multiple AI components
- π¬ Research & Development: Experiment with adaptive multi-agent architectures
- π Educational: Learn about RL-based coordination and multi-agent systems
The framework tracks several key metrics:
- Task Success Rate: Percentage of successfully completed tasks
- Average Reward: Balances success and computational cost
- Agent Utilization: How frequently each agent is selected
- Convergence Speed: How quickly the policy learns optimal patterns
from doa_framework.agents.base import AgentInterface
from doa_framework.structs import SystemState, AgentOutput
class MyCustomAgent(AgentInterface):
def __init__(self, name: str = "MyCustomAgent"):
super().__init__(name)
def execute(self, state: SystemState) -> AgentOutput:
# Your agent logic here
result = self.process_task(state.task_specification)
return AgentOutput(
content=result,
cost=1.5, # Computational cost
metadata={"agent_type": "custom"}
)
from doa_framework import RewardConfig
# Emphasize cost efficiency
cost_focused_config = RewardConfig(
lambda_cost_penalty=0.5, # Higher cost penalty
task_success_bonus=1.0,
task_failure_penalty=-2.0
)
# Emphasize task success
performance_focused_config = RewardConfig(
lambda_cost_penalty=0.05, # Lower cost penalty
task_success_bonus=2.0,
task_failure_penalty=-1.0
)
The system state includes:
- Task Specification: The current task description
- Execution History: Sequence of (agent_name, agent_output) pairs
- Step Information: Current step and maximum allowed steps
- Custom Data: Extensible metadata storage
Based on the paper's formulation:
- Terminal Step:
R_T = r - Ξ» * C_total
- Intermediate Steps:
R_t = -Ξ» * c_t
Where:
r
: Task success reward (+1) or failure penalty (-1)Ξ»
: Cost penalty weight (configurable)C_total
: Total computational costc_t
: Step-wise cost
- Input: State embedding (task + history features)
- Architecture: MLP with ReLU activations
- Output: Probability distribution over available agents
- Training: REINFORCE with gradient clipping
We welcome contributions! Please see our Contributing Guidelines for details.
# Install development dependencies
poetry install --with dev
# Run tests
pytest tests/
# Format code
black doa_framework/ examples/ tests/
# Type checking
mypy doa_framework/
This project is licensed under the MIT License - see the LICENSE file for details.
This framework is inspired by the "Puppeteer" model from:
Dang et al. (2025). "Dynamic Multi-Agent Orchestration with Reinforcement Learning"
- π§ Email: [email protected]
- π¬ Discord: Join our community
Built with β€οΈ by the DOA Team