Building a Multi-Agent AI Voice Assistant Stack

Most AI assistants today are lonely creatures — single agents trying to do everything. But the real magic happens when you build a team of specialized agents, each with their own personality, memory, and voice. In this article, we’ll architect a complete multi-agent voice assistant stack that runs entirely on your own hardware.

Why Multi-Agent?

The single-agent approach hits a wall fast. One agent can’t simultaneously be a coding expert, a research assistant, and a creative writer without losing coherence. Worse, when that one agent hallucinates or gets stuck, your entire system fails.

Multi-agent systems distribute responsibility. Each agent specializes. They collaborate. They fail independently. According to research from Cemri et al., 36.9% of multi-agent failures stem from inter-agent misalignment — agents stepping on each other’s work or contradicting each other. The solution isn’t better models — it’s better architecture.

Architecture Overview

Our stack combines:

Ollama for local LLM inference with per-agent isolation
Qdrant for vector memory with collection-based separation
VibeVoice for TTS with voice cloning per agent
Discord as our multi-channel interface
OpenClaw for agent orchestration

                    ┌─────────────────┐
                    │  Discord Users  │
                    └────────┬────────┘
                             │
                             v
┌─────────────────────────────────────────────────────┐
│                 Discord Gateway                      │
│         (Routes messages by channel/agent)           │
└─────────────────────────────────────────────────────┘
                             │
        ┌────────────────────┼────────────────────┐
        v                    v                    v
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Agent One   │    │  Agent Two   │    │  Agent Three │
│ (Researcher) │    │  (Creative)  │    │   (Coder)    │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       v                   v                   v
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Ollama-1   │    │   Ollama-2   │    │   Ollama-3   │
│  (Isolated)  │    │  (Isolated)  │    │  (Isolated)  │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           v
                  ┌─────────────────┐
                  │     Qdrant      │
                  │ (Per-Collection │
                  │  Memory Store)  │
                  └─────────────────┘

Multi-Agent Architecture Diagram System architecture showing Discord gateway, agent containers, LLM proxy, and memory layers

Per-Agent Memory Isolation with Qdrant

Memory is where most multi-agent systems fall apart. If all agents share the same memory pool, they overwrite each other’s context. A research agent’s notes about quantum computing shouldn’t pollute a creative agent’s story ideas.

Qdrant provides two approaches to isolation:

Option 1: Separate Collections (Recommended)

Each agent gets its own dedicated collection:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://192.168.1.100:6333")

# Create isolated collections per agent
agents = ["agent-researcher", "agent-creative", "agent-coder"]

for agent_id in agents:
    client.create_collection(
        collection_name=f"memories_{agent_id}",
        vectors_config=VectorParams(
            size=1536,  # embedding dimensions
            distance=Distance.COSINE
        )
    )

Pros: True isolation, different schemas per agent, simple permissions Cons: ~30MB overhead per collection (acceptable for small teams)

Option 2: Payload-Based Multitenancy

Single collection with agent ID filtering:

# Store with agent metadata
client.upsert(
    collection_name="memories",
    points=[{
        "id": memory_id,
        "vector": embedding,
        "payload": {
            "agent_id": "agent-researcher",
            "user_id": "user-123",
            "text": "User prefers detailed explanations",
            "timestamp": "2026-03-28T10:30:00Z"
        }
    }]
)

# Retrieve with filter
results = client.search(
    collection_name="memories",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="agent_id",
                match=MatchValue(value="agent-researcher")
            )
        ]
    ),
    limit=5
)

Pros: Efficient, single schema, good for 100+ agents Cons: Less isolation, filter complexity

Memory Retrieval and Context Injection

When a user messages an agent, we retrieve relevant memories and inject them into the system prompt:

async def build_agent_context(agent_id: str, user_message: str) -> str:
    # Embed the user message
    embedding = await embed_text(user_message)
    
    # Retrieve from agent's private collection
    memories = qdrant_client.search(
        collection_name=f"memories_{agent_id}",
        query_vector=embedding,
        limit=5
    )
    
    # Format for context
    memory_context = "\n".join([
        f"- {m.payload['text']}" for m in memories
    ])
    
    # Build system prompt with injected memories
    return f"""You are {agent_id}, a specialized AI assistant.

Relevant context from previous conversations:
{memory_context}

Respond naturally, incorporating this context where appropriate."""

LLM Isolation with Ollama

Running multiple agents on one Ollama instance creates contention. Agent one’s large context window blocks agent two’s request. Worse, if one agent crashes the model, everyone goes down.

We have three isolation strategies:

Strategy A: Port-Per-Agent (Process Isolation)

# docker-compose.yml
services:
  ollama-researcher:
    image: ollama/ollama
    ports:
      - "11434:11434"  # Agent One
    volumes:
      - ollama-researcher:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  ollama-creative:
    image: ollama/ollama
    ports:
      - "11435:11434"  # Agent Two
    volumes:
      - ollama-creative:/root/.ollama
    # Shares same GPU with time-slicing

  ollama-coder:
    image: ollama/ollama
    ports:
      - "11436:11434"  # Agent Three
    volumes:
      - ollama-coder:/root/.ollama

Each agent connects to its own Ollama endpoint:

OLLAMA_ENDPOINTS = {
    "agent-researcher": "http://192.168.1.100:11434",
    "agent-creative": "http://192.168.1.100:11435",
    "agent-coder": "http://192.168.1.100:11436"
}

async def generate(agent_id: str, prompt: str):
    endpoint = OLLAMA_ENDPOINTS[agent_id]
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{endpoint}/api/generate",
            json={
                "model": "llama3.1:8b",
                "prompt": prompt,
                "stream": False
            }
        ) as resp:
            return await resp.json()

Strategy B: Kubernetes + vCluster (Enterprise)

For production scale, use vCluster to give each agent its own virtual Kubernetes cluster:

# vcluster.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vcluster-config
data:
  values.yaml: |
    isolation:
      enabled: true
      networkPolicy:
        enabled: true
    sync:
      pods:
        enabled: true

This provides hard multi-tenancy — each agent’s Ollama runs in its own virtual cluster with network isolation, but shares the underlying GPU nodes.

Multi-Channel Discord Architecture

Discord serves as our user-facing interface. Different channels route to different agents:

AGENT_ROUTES = {
    "#research": "agent-researcher",
    "#creative": "agent-creative",
    "#coding": "agent-coder",
    "#general": "agent-default"
}

@bot.event
async def on_message(message):
    # Skip bot messages
    if message.author.bot:
        return
    
    # Determine which agent handles this channel
    channel_name = message.channel.name
    agent_id = AGENT_ROUTES.get(f"#{channel_name}", "agent-default")
    
    # Build context with agent's private memories
    context = await build_agent_context(agent_id, message.content)
    
    # Generate response
    response = await generate(agent_id, context + "\n\nUser: " + message.content)
    
    # Convert to voice if in voice channel
    if message.author.voice:
        audio = await generate_tts(agent_id, response)
        await play_in_voice_channel(message.guild, audio)
    else:
        await message.channel.send(response)

Per-Agent Voice Cloning with VibeVoice

Each agent gets its own voice. Users hear the difference immediately — the researcher speaks with measured precision, the creative with expressive warmth.

Message Flow Diagram Flow: Discord message → Gateway → Agent → Memory retrieval → LLM → TTS → Response

Voice Assignment

VOICE_CONFIGS = {
    "agent-researcher": {
        "voice_id": "voice-analytical-1",
        "speed": 0.95,
        "pitch": -0.1,
        "personality": "precise and thorough"
    },
    "agent-creative": {
        "voice_id": "voice-expressive-1",
        "speed": 1.05,
        "pitch": 0.1,
        "personality": "warm and imaginative"
    },
    "agent-coder": {
        "voice_id": "voice-technical-1",
        "speed": 1.0,
        "pitch": 0.0,
        "personality": "concise and direct"
    }
}

Voice Generation

import requests

async def generate_tts(agent_id: str, text: str) -> bytes:
    config = VOICE_CONFIGS[agent_id]
    
    response = requests.post(
        "http://192.168.1.100:7860/generate",
        json={
            "text": text,
            "voice": config["voice_id"],
            "speed": config["speed"],
            "pitch": config["pitch"]
        }
    )
    
    return response.content

Training Custom Voices

Record 30-60 seconds of clean speech per voice persona
Upload to your TTS service (VibeVoice, ElevenLabs, etc.)
Note the returned voice_id
Map to agent configuration

The result: three agents, three distinct voices, one seamless experience.

Complete Docker Compose Stack

Here’s a production-ready configuration:

version: "3.8"

services:
  # Vector Database - All agents share, collections isolate
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant-storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
    restart: unless-stopped

  # LLM Backends - One per agent for isolation
  ollama-researcher:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-researcher:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  ollama-creative:
    image: ollama/ollama:latest
    ports:
      - "11435:11434"
    volumes:
      - ollama-creative:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  ollama-coder:
    image: ollama/ollama:latest
    ports:
      - "11436:11434"
    volumes:
      - ollama-coder:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  # TTS Service - Shared, voice_id isolates
  vibevoice:
    image: vibevoice/tts:latest
    ports:
      - "7860:7860"
    volumes:
      - vibevoice-voices:/app/voices
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  # Agent Orchestrator - Routes and manages agents
  orchestrator:
    build: ./orchestrator
    environment:
      - QDRANT_URL=http://qdrant:6333
      - RESEARCHER_OLLAMA=http://ollama-researcher:11434
      - CREATIVE_OLLAMA=http://ollama-creative:11435
      - CODER_OLLAMA=http://ollama-coder:11436
      - TTS_URL=http://vibevoice:7860
      - DISCORD_TOKEN=${DISCORD_TOKEN}
    depends_on:
      - qdrant
      - ollama-researcher
      - ollama-creative
      - ollama-coder
      - vibevoice
    restart: unless-stopped

volumes:
  qdrant-storage:
  ollama-researcher:
  ollama-creative:
  ollama-coder:
  vibevoice-voices:

Deployment Considerations

Hardware Requirements

Component	Minimum	Recommended
GPU VRAM	12GB	24GB+ (for concurrent agents)
RAM	16GB	32GB
Storage	100GB SSD	500GB NVMe
Network	1Gbps	10Gbps (for remote workers)

Security Best Practices

Network Isolation: Each Ollama instance should only accept connections from the orchestrator
API Key Rotation: Per-agent Discord tokens, rotate monthly
Memory Encryption: Enable Qdrant TLS for memory at rest
Audit Logging: Log all agent-to-agent communication for debugging

Performance Optimization

GPU Time-Slicing: NVIDIA MPS allows multiple Ollama instances to share one GPU
Embedding Cache: Cache common embeddings to reduce compute
Lazy Loading: Load models only when first requested
Connection Pooling: Reuse Qdrant connections across requests

Monitoring and Debugging

Key Metrics

# Track per-agent performance
METRICS = {
    "agent-researcher": {
        "requests_total": Counter(),
        "latency_seconds": Histogram(),
        "memory_hits": Counter(),
        "tokens_generated": Counter()
    },
    # ... per agent
}

Debugging Multi-Agent Systems

When agents conflict:

Check memory isolation: GET /collections in Qdrant
Verify Ollama endpoints respond independently
Review agent routing logs
Inspect voice ID mapping

Conclusion

Building a multi-agent voice stack isn’t about adding complexity — it’s about building resilience. When your research agent goes down, your creative agent keeps working. When your coder hallucinates, your researcher provides grounding.

The architecture we’ve built here gives you:

Isolation: Per-agent memory and LLM contexts
Voice: Distinct personalities through TTS cloning
Scale: Add agents without rearchitecting
Control: Full ownership of your data and models

Start with two agents. Add a third when you’re comfortable. The stack grows with you.

Want to discuss multi-agent architectures? Join the conversation on Discord or check out the source code.

Building a Multi-Agent AI Voice Assistant Stack: Per-Agent Memory, Multi-Channel Voice, and Isolated LLM Contexts

Building a Multi-Agent AI Voice Assistant Stack

Why Multi-Agent?

Architecture Overview

Per-Agent Memory Isolation with Qdrant

Option 1: Separate Collections (Recommended)

Option 2: Payload-Based Multitenancy

Memory Retrieval and Context Injection

LLM Isolation with Ollama

Strategy A: Port-Per-Agent (Process Isolation)

Strategy B: Kubernetes + vCluster (Enterprise)

Multi-Channel Discord Architecture

Per-Agent Voice Cloning with VibeVoice

Voice Assignment

Voice Generation

Training Custom Voices

Complete Docker Compose Stack

Deployment Considerations

Hardware Requirements

Security Best Practices

Performance Optimization

Monitoring and Debugging

Key Metrics

Debugging Multi-Agent Systems

Conclusion

Anthony Lattanzio

Comments

Building a Multi-Agent AI Voice Assistant Stack

Why Multi-Agent?

Architecture Overview

Per-Agent Memory Isolation with Qdrant

Option 1: Separate Collections (Recommended)

Option 2: Payload-Based Multitenancy

Memory Retrieval and Context Injection

LLM Isolation with Ollama

Strategy A: Port-Per-Agent (Process Isolation)

Strategy B: Kubernetes + vCluster (Enterprise)

Multi-Channel Discord Architecture

Per-Agent Voice Cloning with VibeVoice

Voice Assignment

Voice Generation

Training Custom Voices

Complete Docker Compose Stack

Deployment Considerations

Hardware Requirements

Security Best Practices

Performance Optimization

Monitoring and Debugging

Key Metrics

Debugging Multi-Agent Systems

Conclusion

Get Early Access

Anthony Lattanzio

If you liked this, check out...

Building a Budget Intel N100 Homelab: The Ultimate 2024 Guide

Comments