Building a Self-Hosted AI Memory Stack

Every conversation with ChatGPT starts blank. Ask about your project, and it forgets by the next chat. This isn’t a bug—it’s a fundamental limitation of stateless AI systems.

But what if your AI assistant actually remembered? What if it knew your preferences, recalled previous conversations, and built knowledge over time—all without sending your data to third-party servers?

That’s the promise of a self-hosted AI memory stack. In this guide, we’ll build one from scratch.

Why AI Memory Matters

Modern AI assistants suffer from amnesia. Every session is a fresh start because:

Context windows are limited: Even large models forget after a few hours of conversation
Stateless by design: APIs don’t persist anything between calls
Privacy vs. convenience tradeoff: Cloud memory services require trusting providers with your data

A self-hosted memory stack solves these problems by giving your AI:

Long-term memory: Store and retrieve information across sessions
Semantic understanding: Find relevant memories even with different wording
Privacy control: Your data never leaves your infrastructure
Customization: Tune memory retrieval for your specific use case

The Building Blocks

A complete AI memory stack consists of three core components:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Documents  │ ──> │   Embeddings  │ ──> │  Vector DB   │
│   (Memory)   │     │   (Meaning)   │     │  (Storage)   │
└──────────────┘     └──────────────┘     └──────────────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Response   │ &lt;── │     LLM      │ &lt;── │   Retrieval  │
│              │     │              │     │   (Search)   │
└──────────────┘     └──────────────┘     └──────────────┘

1. Embedding Models

Embeddings convert text into numerical vectors that capture semantic meaning. “I love pizza” and “Pizza is my favorite food” produce similar vectors because they mean similar things.

Popular Self-Hosted Options:

Model	Dimensions	Size	Best For
all-MiniLM-L6-v2	384	80MB	Real-time apps, speed-critical
BGE-large-en	1024	1.3GB	Quality semantic search
Nomic-embed-text	768	274MB	Long documents (8K tokens)

2. Vector Databases

Vector databases store and search embeddings efficiently. They power semantic search by finding the “nearest neighbors” to a query embedding.

Top Self-Hosted Options:

Qdrant: Rust-based, memory-efficient, excellent for homelabs
Milvus: Cloud-native, scales to billions of vectors
Weaviate: Developer-friendly, built-in vectorization

We’ll use Qdrant—its single-binary deployment and efficient resource usage make it ideal for self-hosting.

3. Memory Management Framework

This layer handles:

Extracting important information from conversations
Storing memories with appropriate metadata
Retrieving relevant context for new queries

Options include LangChain’s memory modules, LlamaIndex’s indices, or purpose-built tools like Mem0.

Choosing a Vector Database

Comparison of popular vector database architectures

Qdrant: The Homelab Champion

Why Qdrant wins for self-hosting:

# docker-compose.yml - That's literally it
version: '3.8'
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - ./data:/qdrant/storage

Single binary: No complex dependencies
Memory efficient: Rust’s memory safety and efficiency
Rich filtering: Combine semantic search with metadata filters
Built-in UI: Visualize your data at http://localhost:6333/dashboard

Milvus vs Qdrant vs Weaviate Comparison

Feature	Qdrant	Milvus	Weaviate
Best For	Self-hosted, efficiency	Enterprise scale	Developer experience
Complexity	Low	Medium-High	Low-Medium
Memory Use	Very efficient	Higher (distributed)	Moderate
Hybrid Search	Via filtering patterns	Built-in (v2.5+)	Native (best-in-class)
Scale	Millions efficiently	Billions	~50M efficiently

Recommendation: Start with Qdrant. It handles millions of vectors on modest hardware and requires minimal DevOps knowledge.

Setting Up Embeddings

Running Local Embedding Models

The easiest way to run embeddings locally is with Ollama:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull an embedding model
ollama pull nomic-embed-text

# Generate embeddings
ollama run nomic-embed-text "Your text here"

Python Integration

from langchain_community.embeddings import OllamaEmbeddings

# Initialize local embeddings
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

# Generate embedding for text
text = "Building a self-hosted AI memory stack"
vector = embeddings.embed_query(text)

print(f"Embedding dimension: {len(vector)}")
# Output: Embedding dimension: 768

Building the Memory System

Architecture Overview

Memory system architecture: how components interact

Our memory system needs to handle:

Storing memories: Save conversation facts, user preferences, learned information
Semantic retrieval: Find relevant memories regardless of exact wording
Context assembly: Build a relevant context window for the AI
Memory decay: Forget irrelevant or outdated information

Implementation with LangChain + Qdrant

from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings import OllamaEmbeddings
from qdrant_client import QdrantClient
from datetime import datetime

class MemorySystem:
    def __init__(self, qdrant_url="http://localhost:6333"):
        self.client = QdrantClient(url=qdrant_url)
        self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
        self.vectorstore = Qdrant(
            client=self.client,
            collection_name="memories",
            embeddings=self.embeddings,
        )
    
    def remember(self, content: str, speaker: str = "system"):
        """Store a memory with metadata."""
        metadata = {
            "timestamp": datetime.now().isoformat(),
            "speaker": speaker,
            "importance": 0.5,  # Can be adjusted based on content
        }
        self.vectorstore.add_texts(
            texts=[content],
            metadatas=[metadata]
        )
    
    def recall(self, query: str, k: int = 5):
        """Retrieve relevant memories."""
        results = self.vectorstore.similarity_search(
            query=query,
            k=k
        )
        return results
    
    def build_context(self, query: str, max_tokens: int = 2000):
        """Build a context string from relevant memories."""
        memories = self.recall(query)
        context = "Relevant memories:\n\n"
        
        for mem in memories:
            context += f"- {mem.page_content}\n"
            if len(context.split()) > max_tokens:
                break
        
        return context

# Usage
memory = MemorySystem()
memory.remember("User prefers dark mode in all applications")
memory.remember("User's timezone is EST (America/New_York)")
memory.remember("User is working on a homelab with Proxmox")

# Later, when asked about preferences...
context = memory.build_context("What settings should I use?")
print(context)
# Output:
# Relevant memories:
# - User prefers dark mode in all applications
# - User's timezone is EST (America/New_York)
# - User is working on a homelab with Proxmox

Memory Types Implementation

Different memories serve different purposes:

from enum import Enum
from typing import Optional

class MemoryType(Enum):
    SEMANTIC = "semantic"      # Facts, concepts (persistent)
    EPISODIC = "episodic"      # Events, conversations (time-decay)
    WORKING = "working"        # Current context (temporary)

class TypedMemorySystem(MemorySystem):
    def remember(
        self, 
        content: str, 
        memory_type: MemoryType = MemoryType.SEMANTIC,
        speaker: str = "user",
        importance: float = 0.5
    ):
        """Store a typed memory."""
        metadata = {
            "timestamp": datetime.now().isoformat(),
            "speaker": speaker,
            "memory_type": memory_type.value,
            "importance": importance,
        }
        
        self.vectorstore.add_texts(
            texts=[content],
            metadatas=[metadata]
        )
    
    def recall_by_type(
        self, 
        query: str, 
        memory_type: MemoryType,
        k: int = 5
    ):
        """Retrieve memories of a specific type."""
        from qdrant_client.http.models import Filter, FieldCondition, MatchValue
        
        results = self.vectorstore.similarity_search(
            query=query,
            k=k,
            filter=Filter(
                must=[
                    FieldCondition(
                        key="memory_type",
                        match=MatchValue(value=memory_type.value)
                    )
                ]
            )
        )
        return results

# Usage
memory = TypedMemorySystem()

# Store semantic memory (facts)
memory.remember(
    "User's favorite programming language is Python",
    memory_type=MemoryType.SEMANTIC,
    importance=0.8
)

# Store episodic memory (events)
memory.remember(
    "User deployed their first Kubernetes cluster yesterday",
    memory_type=MemoryType.EPISODIC,
    importance=0.6
)

# Recall only facts
facts = memory.recall_by_type(
    "What do I know about this user?",
    memory_type=MemoryType.SEMANTIC
)

RAG Architecture for Memory

Retrieval-Augmented Generation (RAG) is the foundation of AI memory. Here’s how to build an effective RAG pipeline.

Chunking Strategies

The way you split documents directly impacts retrieval quality:

Rimplified RAG pipeline flowchart

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)

# Strategy 1: Recursive chunking (best for structured docs)
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,  # 10% overlap for context continuity
)

# Strategy 2: Markdown-aware chunking
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
)

# For conversation memory, use larger chunks to preserve context
conversation_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,  # Larger for conversation segments
    chunk_overlap=100,
)

Hybrid Search Implementation

Combine semantic + keyword search for better retrieval:

from qdrant_client.http.models import (
    SparseVector,
    SearchRequest,
    Query,
    FusionQuery,
)

class HybridSearchMemory(MemorySystem):
    def hybrid_search(
        self,
        query: str,
        k: int = 5,
        semantic_weight: float = 0.7
    ):
        """Combine vector similarity with keyword matching."""
        
        # Get embedding for query
        query_vector = self.embeddings.embed_query(query)
        
        # Hybrid search using Qdrant's fusion
        results = self.client.search(
            collection_name="memories",
            query_vector=query_vector,
            query=Query(
                fusion=FusionQuery(
                    queries=[
                        Query.NearestVector(vector=query_vector),
                        Query.Keyword(keyword=query),
                    ],
                    weights=[semantic_weight, 1 - semantic_weight]
                )
            ),
            limit=k
        )
        
        return results

Reranking for Better Results

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank

# Add reranking to improve relevance
reranker = FlashrankRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=self.vectorstore.as_retriever(search_kwargs={"k": 10})
)

# Rerank the top candidates
relevant_docs = compression_retriever.get_relevant_documents(query)

Production Deployment

Complete Docker Stack

# docker-compose.yml
version: '3.8'

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      - QDRANT__LOG_LEVEL=INFO
    restart: unless-stopped

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    restart: unless-stopped

volumes:
  qdrant_data:
  ollama_data:
  redis_data:

Hardware Requirements

Scale	CPU	RAM	Storage	Use Case
Development	4 cores	16GB	100GB SSD	Testing, prototypes
Production (Small)	8 cores	32GB	500GB NVMe	<1M vectors, personal use
Production (Medium)	16 cores	64GB	1TB NVMe	1-10M vectors, team use
Enterprise	32+ cores	128GB+	Multiple TB	10M+ vectors, organization

Monitoring Setup

# Prometheus scrape config
scrape_configs:
  - job_name: 'qdrant'
    static_configs:
      - targets: ['qdrant:6333']
        
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama:11434']

Key metrics to monitor:

Query latency (p50, p95, p99)
Index size and memory usage
Embedding generation throughput
Retrieval recall accuracy

Real-World Example: Personal AI Assistant

Let’s build a practical memory system for a personal assistant:

from datetime import datetime
from typing import List, Dict, Any

class PersonalAssistantMemory:
    """
    A practical AI memory system for a personal assistant.
    
    Handles:
    - User preferences (semantic memory)
    - Conversation history (episodic memory)
    - Tasks and reminders
    - Learned facts about the user
    """
    
    def __init__(self, qdrant_url: str = "http://localhost:6333"):
        from qdrant_client import QdrantClient
        from langchain_community.embeddings import OllamaEmbeddings
        from langchain_community.vectorstores import Qdrant
        
        self.client = QdrantClient(url=qdrant_url)
        self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
        
        # Create collections for different memory types
        self._init_collections()
        
        self.preferences = Qdrant(
            client=self.client,
            collection_name="preferences",
            embeddings=self.embeddings,
        )
        
        self.conversations = Qdrant(
            client=self.client,
            collection_name="conversations",
            embeddings=self.embeddings,
        )
        
        self.knowledge = Qdrant(
            client=self.client,
            collection_name="knowledge",
            embeddings=self.embeddings,
        )
    
    def _init_collections(self):
        """Initialize Qdrant collections."""
        from qdrant_client.http.models import Distance, VectorParams
        
        collections = ["preferences", "conversations", "knowledge"]
        for name in collections:
            if not self.client.collection_exists(name):
                self.client.create_collection(
                    collection_name=name,
                    vectors_config=VectorParams(
                        size=768,  # nomic-embed-text dimension
                        distance=Distance.COSINE
                    )
                )
    
    def learn_preference(self, preference: str, category: str = "general"):
        """Store a user preference."""
        self.preferences.add_texts(
            texts=[preference],
            metadatas=[{
                "type": "preference",
                "category": category,
                "learned_at": datetime.now().isoformat()
            }]
        )
    
    def store_conversation(self, exchange: List[Dict[str, str]]):
        """Store a conversation exchange."""
        for turn in exchange:
            self.conversations.add_texts(
                texts=[f"{turn['speaker']}: {turn['content']}"],
                metadatas=[{
                    "speaker": turn['speaker'],
                    "timestamp": datetime.now().isoformat()
                }]
            )
    
    def add_knowledge(self, fact: str, source: str = "user"):
        """Store a learned fact."""
        self.knowledge.add_texts(
            texts=[fact],
            metadatas=[{
                "type": "fact",
                "source": source,
                "learned_at": datetime.now().isoformat()
            }]
        )
    
    def get_context_for_query(self, query: str) -> str:
        """Build a comprehensive context for answering a query."""
        # Retrieve from each collection
        prefs = self.preferences.similarity_search(query, k=3)
        convs = self.conversations.similarity_search(query, k=5)
        facts = self.knowledge.similarity_search(query, k=3)
        
        # Build context
        context = []
        
        if prefs:
            context.append("User preferences:")
            for p in prefs:
                context.append(f"  - {p.page_content}")
        
        if facts:
            context.append("\nKnown facts:")
            for f in facts:
                context.append(f"  - {f.page_content}")
        
        if convs:
            context.append("\nRelevant conversation history:")
            for c in convs[:2]:  # Limit to 2 most relevant
                context.append(f"  {c.page_content}")
        
        return "\n".join(context)
    
    def build_prompt_with_memory(
        self, 
        query: str, 
        system_prompt: str = ""
    ) -> str:
        """Build a complete prompt with memory context."""
        memory_context = self.get_context_for_query(query)
        
        return f"""{system_prompt}

{memory_context}

User query: {query}

Provide a response that takes into account the user's preferences, past conversations, and known facts about them.
"""
    
    def summarize_and_forget(self, days_old: int = 30):
        """Summarize old conversations and clean up."""
        # In production, implement a summarization pipeline
        # that consolidates old conversations into key facts
        pass


# Example usage
def main():
    memory = PersonalAssistantMemory()
    
    # Learn about the user
    memory.learn_preference("User prefers dark mode interfaces", "ui")
    memory.learn_preference("User's timezone is EST (America/New_York)", "settings")
    memory.learn_preference("User works as a software engineer", "profile")
    
    # Store conversation
    memory.store_conversation([
        {"speaker": "user", "content": "I'm setting up a new homelab server"},
        {"speaker": "assistant", "content": "What kind of workloads do you plan to run?"},
        {"speaker": "user", "content": "Mostly Docker containers and maybe some VMs for testing"},
    ])
    
    # Add learned knowledge
    memory.add_knowledge("User has a Proxmox homelab with 3 nodes", "conversation")
    memory.add_knowledge("User prefers Python for scripting", "observation")
    
    # Later, when answering a new question...
    query = "What's the best way to automate my server management?"
    prompt = memory.build_prompt_with_memory(
        query,
        system_prompt="You are a helpful assistant with knowledge of the user."
    )
    
    print(prompt)
    # Output includes user preferences, relevant conversation history, and known facts


if __name__ == "__main__":
    main()

Testing Your Memory Stack

Verification Checklist

Embedding Generation Works

curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "test"}'

Qdrant is Running
```
curl http://localhost:6333/collections
```

Memory Storage Test

memory.remember("Test memory storage")
results = memory.recall("Test")
assert len(results) > 0

End-to-End Retrieval
- Store multiple related memories
- Query with slightly different wording
- Verify semantic retrieval works

Common Issues and Fixes

Issue	Cause	Solution
Slow retrieval	Large index size	Add quantization, increase HNSW ef
Poor recall	Bad embeddings	Try larger model (BGE-large)
Memory leaks	Connection pooling	Implement connection manager
Out of memory	Index too large	Add disk-based index or sharding

Conclusion

Building a self-hosted AI memory stack transforms your AI assistant from a forgetful chatbot into a genuinely helpful companion that remembers your preferences, learns from conversations, and respects your privacy.

The key components—Qdrant for vector storage, Ollama for local embeddings, and LangChain for orchestration—are all production-ready and mature. Start simple with a basic implementation, then add advanced features like hybrid search, reranking, and memory decay as your needs grow.

Your AI should remember. Now it can.

Building a Self-Hosted AI Memory Stack

Building a Self-Hosted AI Memory Stack

Why AI Memory Matters

The Building Blocks

1. Embedding Models

2. Vector Databases

3. Memory Management Framework

Choosing a Vector Database

Qdrant: The Homelab Champion

Milvus vs Qdrant vs Weaviate Comparison

Setting Up Embeddings

Running Local Embedding Models

Python Integration

Building the Memory System

Architecture Overview

Implementation with LangChain + Qdrant

Memory Types Implementation

RAG Architecture for Memory

Chunking Strategies

Hybrid Search Implementation

Reranking for Better Results

Production Deployment

Complete Docker Stack

Hardware Requirements

Monitoring Setup

Real-World Example: Personal AI Assistant

Testing Your Memory Stack

Verification Checklist

Common Issues and Fixes

Conclusion

Resources

Anthony Lattanzio

Comments

Building a Self-Hosted AI Memory Stack

Why AI Memory Matters

The Building Blocks

1. Embedding Models

2. Vector Databases

3. Memory Management Framework

Choosing a Vector Database

Qdrant: The Homelab Champion

Milvus vs Qdrant vs Weaviate Comparison

Setting Up Embeddings

Running Local Embedding Models

Python Integration

Building the Memory System

Architecture Overview

Implementation with LangChain + Qdrant

Memory Types Implementation

RAG Architecture for Memory

Chunking Strategies

Hybrid Search Implementation

Reranking for Better Results

Production Deployment

Complete Docker Stack

Hardware Requirements

Monitoring Setup

Real-World Example: Personal AI Assistant

Testing Your Memory Stack

Verification Checklist

Common Issues and Fixes

Conclusion

Resources

Get Early Access

Anthony Lattanzio

Comments