Building a Multi-Agent AI Voice Assistant Stack: Per-Agent Memory, Multi-Channel Voice, and Isolated LLM Contexts

Learn how to build a production-ready multi-agent AI voice system with isolated memory, per-agent voice cloning, and multi-channel Discord integration using Ollama, Qdrant, and OpenClaw.

• 7 min read
aimulti-agentvoiceollamaqdrantdiscordttshomelab
Building a Multi-Agent AI Voice Assistant Stack: Per-Agent Memory, Multi-Channel Voice, and Isolated LLM Contexts

Building a Multi-Agent AI Voice Assistant Stack

Most AI assistants today are lonely creatures — single agents trying to do everything. But the real magic happens when you build a team of specialized agents, each with their own personality, memory, and voice. In this article, we’ll architect a complete multi-agent voice assistant stack that runs entirely on your own hardware.

Why Multi-Agent?

The single-agent approach hits a wall fast. One agent can’t simultaneously be a coding expert, a research assistant, and a creative writer without losing coherence. Worse, when that one agent hallucinates or gets stuck, your entire system fails.

Multi-agent systems distribute responsibility. Each agent specializes. They collaborate. They fail independently. According to research from Cemri et al., 36.9% of multi-agent failures stem from inter-agent misalignment — agents stepping on each other’s work or contradicting each other. The solution isn’t better models — it’s better architecture.

Architecture Overview

Our stack combines:

  • Ollama for local LLM inference with per-agent isolation
  • Qdrant for vector memory with collection-based separation
  • VibeVoice for TTS with voice cloning per agent
  • Discord as our multi-channel interface
  • OpenClaw for agent orchestration
                    ┌─────────────────┐
                    │  Discord Users  │
                    └────────┬────────┘

                             v
┌─────────────────────────────────────────────────────┐
│                 Discord Gateway                      │
│         (Routes messages by channel/agent)           │
└─────────────────────────────────────────────────────┘

        ┌────────────────────┼────────────────────┐
        v                    v                    v
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Agent One   │    │  Agent Two   │    │  Agent Three │
│ (Researcher) │    │  (Creative)  │    │   (Coder)    │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       v                   v                   v
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Ollama-1   │    │   Ollama-2   │    │   Ollama-3   │
│  (Isolated)  │    │  (Isolated)  │    │  (Isolated)  │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           v
                  ┌─────────────────┐
                  │     Qdrant      │
                  │ (Per-Collection │
                  │  Memory Store)  │
                  └─────────────────┘

Multi-Agent Architecture Diagram System architecture showing Discord gateway, agent containers, LLM proxy, and memory layers

Per-Agent Memory Isolation with Qdrant

Memory is where most multi-agent systems fall apart. If all agents share the same memory pool, they overwrite each other’s context. A research agent’s notes about quantum computing shouldn’t pollute a creative agent’s story ideas.

Qdrant provides two approaches to isolation:

Each agent gets its own dedicated collection:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://192.168.1.100:6333")

# Create isolated collections per agent
agents = ["agent-researcher", "agent-creative", "agent-coder"]

for agent_id in agents:
    client.create_collection(
        collection_name=f"memories_{agent_id}",
        vectors_config=VectorParams(
            size=1536,  # embedding dimensions
            distance=Distance.COSINE
        )
    )

Pros: True isolation, different schemas per agent, simple permissions Cons: ~30MB overhead per collection (acceptable for small teams)

Option 2: Payload-Based Multitenancy

Single collection with agent ID filtering:

# Store with agent metadata
client.upsert(
    collection_name="memories",
    points=[{
        "id": memory_id,
        "vector": embedding,
        "payload": {
            "agent_id": "agent-researcher",
            "user_id": "user-123",
            "text": "User prefers detailed explanations",
            "timestamp": "2026-03-28T10:30:00Z"
        }
    }]
)

# Retrieve with filter
results = client.search(
    collection_name="memories",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="agent_id",
                match=MatchValue(value="agent-researcher")
            )
        ]
    ),
    limit=5
)

Pros: Efficient, single schema, good for 100+ agents Cons: Less isolation, filter complexity

Memory Retrieval and Context Injection

When a user messages an agent, we retrieve relevant memories and inject them into the system prompt:

async def build_agent_context(agent_id: str, user_message: str) -> str:
    # Embed the user message
    embedding = await embed_text(user_message)
    
    # Retrieve from agent's private collection
    memories = qdrant_client.search(
        collection_name=f"memories_{agent_id}",
        query_vector=embedding,
        limit=5
    )
    
    # Format for context
    memory_context = "\n".join([
        f"- {m.payload['text']}" for m in memories
    ])
    
    # Build system prompt with injected memories
    return f"""You are {agent_id}, a specialized AI assistant.

Relevant context from previous conversations:
{memory_context}

Respond naturally, incorporating this context where appropriate."""

LLM Isolation with Ollama

Running multiple agents on one Ollama instance creates contention. Agent one’s large context window blocks agent two’s request. Worse, if one agent crashes the model, everyone goes down.

We have three isolation strategies:

Strategy A: Port-Per-Agent (Process Isolation)

# docker-compose.yml
services:
  ollama-researcher:
    image: ollama/ollama
    ports:
      - "11434:11434"  # Agent One
    volumes:
      - ollama-researcher:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  ollama-creative:
    image: ollama/ollama
    ports:
      - "11435:11434"  # Agent Two
    volumes:
      - ollama-creative:/root/.ollama
    # Shares same GPU with time-slicing

  ollama-coder:
    image: ollama/ollama
    ports:
      - "11436:11434"  # Agent Three
    volumes:
      - ollama-coder:/root/.ollama

Each agent connects to its own Ollama endpoint:

OLLAMA_ENDPOINTS = {
    "agent-researcher": "http://192.168.1.100:11434",
    "agent-creative": "http://192.168.1.100:11435",
    "agent-coder": "http://192.168.1.100:11436"
}

async def generate(agent_id: str, prompt: str):
    endpoint = OLLAMA_ENDPOINTS[agent_id]
    async with aiohttp.ClientSession() as session:
        async with session.post(
            f"{endpoint}/api/generate",
            json={
                "model": "llama3.1:8b",
                "prompt": prompt,
                "stream": False
            }
        ) as resp:
            return await resp.json()

Strategy B: Kubernetes + vCluster (Enterprise)

For production scale, use vCluster to give each agent its own virtual Kubernetes cluster:

# vcluster.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vcluster-config
data:
  values.yaml: |
    isolation:
      enabled: true
      networkPolicy:
        enabled: true
    sync:
      pods:
        enabled: true

This provides hard multi-tenancy — each agent’s Ollama runs in its own virtual cluster with network isolation, but shares the underlying GPU nodes.

Multi-Channel Discord Architecture

Discord serves as our user-facing interface. Different channels route to different agents:

AGENT_ROUTES = {
    "#research": "agent-researcher",
    "#creative": "agent-creative",
    "#coding": "agent-coder",
    "#general": "agent-default"
}

@bot.event
async def on_message(message):
    # Skip bot messages
    if message.author.bot:
        return
    
    # Determine which agent handles this channel
    channel_name = message.channel.name
    agent_id = AGENT_ROUTES.get(f"#{channel_name}", "agent-default")
    
    # Build context with agent's private memories
    context = await build_agent_context(agent_id, message.content)
    
    # Generate response
    response = await generate(agent_id, context + "\n\nUser: " + message.content)
    
    # Convert to voice if in voice channel
    if message.author.voice:
        audio = await generate_tts(agent_id, response)
        await play_in_voice_channel(message.guild, audio)
    else:
        await message.channel.send(response)

Per-Agent Voice Cloning with VibeVoice

Each agent gets its own voice. Users hear the difference immediately — the researcher speaks with measured precision, the creative with expressive warmth.

Message Flow Diagram Flow: Discord message → Gateway → Agent → Memory retrieval → LLM → TTS → Response

Voice Assignment

VOICE_CONFIGS = {
    "agent-researcher": {
        "voice_id": "voice-analytical-1",
        "speed": 0.95,
        "pitch": -0.1,
        "personality": "precise and thorough"
    },
    "agent-creative": {
        "voice_id": "voice-expressive-1",
        "speed": 1.05,
        "pitch": 0.1,
        "personality": "warm and imaginative"
    },
    "agent-coder": {
        "voice_id": "voice-technical-1",
        "speed": 1.0,
        "pitch": 0.0,
        "personality": "concise and direct"
    }
}

Voice Generation

import requests

async def generate_tts(agent_id: str, text: str) -> bytes:
    config = VOICE_CONFIGS[agent_id]
    
    response = requests.post(
        "http://192.168.1.100:7860/generate",
        json={
            "text": text,
            "voice": config["voice_id"],
            "speed": config["speed"],
            "pitch": config["pitch"]
        }
    )
    
    return response.content

Training Custom Voices

  1. Record 30-60 seconds of clean speech per voice persona
  2. Upload to your TTS service (VibeVoice, ElevenLabs, etc.)
  3. Note the returned voice_id
  4. Map to agent configuration

The result: three agents, three distinct voices, one seamless experience.

Complete Docker Compose Stack

Here’s a production-ready configuration:

version: "3.8"

services:
  # Vector Database - All agents share, collections isolate
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - qdrant-storage:/qdrant/storage
    environment:
      - QDRANT__SERVICE__HTTP_PORT=6333
    restart: unless-stopped

  # LLM Backends - One per agent for isolation
  ollama-researcher:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-researcher:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  ollama-creative:
    image: ollama/ollama:latest
    ports:
      - "11435:11434"
    volumes:
      - ollama-creative:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  ollama-coder:
    image: ollama/ollama:latest
    ports:
      - "11436:11434"
    volumes:
      - ollama-coder:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  # TTS Service - Shared, voice_id isolates
  vibevoice:
    image: vibevoice/tts:latest
    ports:
      - "7860:7860"
    volumes:
      - vibevoice-voices:/app/voices
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

  # Agent Orchestrator - Routes and manages agents
  orchestrator:
    build: ./orchestrator
    environment:
      - QDRANT_URL=http://qdrant:6333
      - RESEARCHER_OLLAMA=http://ollama-researcher:11434
      - CREATIVE_OLLAMA=http://ollama-creative:11435
      - CODER_OLLAMA=http://ollama-coder:11436
      - TTS_URL=http://vibevoice:7860
      - DISCORD_TOKEN=${DISCORD_TOKEN}
    depends_on:
      - qdrant
      - ollama-researcher
      - ollama-creative
      - ollama-coder
      - vibevoice
    restart: unless-stopped

volumes:
  qdrant-storage:
  ollama-researcher:
  ollama-creative:
  ollama-coder:
  vibevoice-voices:

Deployment Considerations

Hardware Requirements

ComponentMinimumRecommended
GPU VRAM12GB24GB+ (for concurrent agents)
RAM16GB32GB
Storage100GB SSD500GB NVMe
Network1Gbps10Gbps (for remote workers)

Security Best Practices

  1. Network Isolation: Each Ollama instance should only accept connections from the orchestrator
  2. API Key Rotation: Per-agent Discord tokens, rotate monthly
  3. Memory Encryption: Enable Qdrant TLS for memory at rest
  4. Audit Logging: Log all agent-to-agent communication for debugging

Performance Optimization

  • GPU Time-Slicing: NVIDIA MPS allows multiple Ollama instances to share one GPU
  • Embedding Cache: Cache common embeddings to reduce compute
  • Lazy Loading: Load models only when first requested
  • Connection Pooling: Reuse Qdrant connections across requests

Monitoring and Debugging

Key Metrics

# Track per-agent performance
METRICS = {
    "agent-researcher": {
        "requests_total": Counter(),
        "latency_seconds": Histogram(),
        "memory_hits": Counter(),
        "tokens_generated": Counter()
    },
    # ... per agent
}

Debugging Multi-Agent Systems

When agents conflict:

  1. Check memory isolation: GET /collections in Qdrant
  2. Verify Ollama endpoints respond independently
  3. Review agent routing logs
  4. Inspect voice ID mapping

Conclusion

Building a multi-agent voice stack isn’t about adding complexity — it’s about building resilience. When your research agent goes down, your creative agent keeps working. When your coder hallucinates, your researcher provides grounding.

The architecture we’ve built here gives you:

  • Isolation: Per-agent memory and LLM contexts
  • Voice: Distinct personalities through TTS cloning
  • Scale: Add agents without rearchitecting
  • Control: Full ownership of your data and models

Start with two agents. Add a third when you’re comfortable. The stack grows with you.


Want to discuss multi-agent architectures? Join the conversation on Discord or check out the source code.

Anthony Lattanzio

Anthony Lattanzio

Tech Enthusiast & Builder

I'm a tech enthusiast who loves building things with hardware and software. By night, I run a homelab that's grown way beyond what any reasonable person needs. Check out about me for more.

Comments

Powered by GitHub Discussions