The Sanctuary

Writing about interests; Computer Science, Philosophy, Mathematics and AI.

Enterprise LLM Chatbot: RAG Architecture and Implementation

Large Language Models have transformed how enterprises build conversational AI systems. However, deploying LLMs in production requires addressing challenges around accuracy, data privacy, and integration with existing knowledge bases. Retrieval Augmented Generation (RAG) has emerged as the architecture pattern that solves these challenges.

This guide covers the design and implementation of an enterprise-grade LLM chatbot using RAG, from architecture decisions to production deployment.

Why RAG for Enterprise Chatbots?

Standard LLMs have fundamental limitations for enterprise use:

  • Knowledge cutoff: Training data becomes stale
  • Hallucinations: Models confidently generate incorrect information
  • No private data: Cannot access proprietary enterprise knowledge
  • Context limits: Cannot process entire document repositories

RAG addresses these by retrieving relevant context before generation:

RAG Architecture Overview

ApproachKnowledge SourceAccuracyPrivacy
Pure LLMTraining data onlyVariableData exposed in training
Fine-tuningCustom trainingGoodRequires data sharing
RAGRetrieved documentsHighData stays on-premise

Architecture Overview

The enterprise chatbot architecture consists of several layers:

Technical Architecture

Component Breakdown

Frontend Layer

  • Client SPA (React/Angular)
  • Admin interface for knowledge management
  • Chat interface with conversation history

Backend Services

  • Chat Service: Orchestrates RAG pipeline (Java/Spring Boot)
  • Translation Service: Multi-language support (Python/Flask)
  • Configuration Service: Dynamic LLM and chat settings
  • File Service: Document ingestion and processing

AI/ML Layer

  • Embedding Service: Vector generation (Hugging Face)
  • LangChain: RAG orchestration and prompt management
  • LLM Provider: Azure OpenAI / On-premise models

Data Layer

  • Vector Database: Semantic search (Milvus/Qdrant/Weaviate)
  • PostgreSQL: Conversations, configurations, user data
  • Object Storage: Original documents

Infrastructure

  • Kubernetes for orchestration
  • Keycloak for authentication
  • API Gateway for routing

RAG Pipeline Deep Dive

Document Ingestion

The ingestion pipeline prepares documents for semantic search:

Ingestion Pipeline

from langchain.document_loaders import PyPDFLoader, UnstructuredWordLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings

# 1. Load documents
def load_document(file_path: str):
    if file_path.endswith('.pdf'):
        loader = PyPDFLoader(file_path)
    elif file_path.endswith('.docx'):
        loader = UnstructuredWordLoader(file_path)
    return loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " "]
)

# 3. Generate embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# 4. Store in vector database
def ingest_document(file_path: str, collection_name: str):
    documents = load_document(file_path)
    chunks = text_splitter.split_documents(documents)

    vectors = embeddings.embed_documents([c.page_content for c in chunks])

    # Store vectors with metadata
    vector_store.add(
        collection=collection_name,
        vectors=vectors,
        documents=[c.page_content for c in chunks],
        metadata=[{"source": file_path, "page": c.metadata.get("page")}
                  for c in chunks]
    )

Chunking Strategies

Chunk size significantly impacts retrieval quality:

Chunk SizeProsConsBest For
Small (256)Precise retrievalLoses contextFAQ, definitions
Medium (512-1024)BalancedGeneral purposeMost use cases
Large (2048+)Full contextRetrieves irrelevant contentLong-form documents

Overlap ensures context isn’t lost at boundaries:

# Without overlap: sentences cut mid-thought
# With 200 char overlap: context preserved

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200  # 20% overlap recommended
)

Query Processing

When a user asks a question:

from langchain.chains import RetrievalQA
from langchain.llms import AzureOpenAI

# 1. Embed the query
query = "What is the return policy for electronics?"
query_embedding = embeddings.embed_query(query)

# 2. Semantic search
relevant_chunks = vector_store.similarity_search(
    query_embedding,
    collection="product_policies",
    top_k=5,
    threshold=0.7
)

# 3. Build context
context = "\n\n".join([chunk.page_content for chunk in relevant_chunks])

# 4. Generate response with LLM
prompt = f"""Based on the following context, answer the question.
If the answer cannot be found in the context, say "I don't have information about that."

Context:
{context}

Question: {query}

Answer:"""

response = llm.generate(prompt)

Retrieval Strategies

Different retrieval approaches for different needs:

Semantic Search (Default)

# Cosine similarity between query and document embeddings
results = vector_store.similarity_search(query_embedding, top_k=5)

Hybrid Search (Semantic + Keyword)

# Combine vector similarity with BM25 keyword matching
semantic_results = vector_store.similarity_search(query_embedding)
keyword_results = bm25_search(query_text)
results = reciprocal_rank_fusion(semantic_results, keyword_results)

Multi-Query Retrieval

# Generate multiple query variations for better coverage
queries = llm.generate(f"Generate 3 variations of: {query}")
all_results = [vector_store.search(q) for q in queries]
results = deduplicate_and_rank(all_results)

LangChain Integration

LangChain provides the orchestration layer:

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts import PromptTemplate

# Custom prompt template
prompt_template = PromptTemplate(
    input_variables=["context", "question", "chat_history"],
    template="""You are a helpful assistant for enterprise questions.
Use the following context to answer the question.
If you don't know the answer, say so - don't make up information.

Previous conversation:
{chat_history}

Context from knowledge base:
{context}

Question: {question}

Helpful answer:"""
)

# Conversation memory (last 5 exchanges)
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    return_messages=True,
    k=5
)

# RAG chain
chain = ConversationalRetrievalChain.from_llm(
    llm=AzureOpenAI(deployment_name="gpt-4"),
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    memory=memory,
    combine_docs_chain_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

# Query
result = chain({"question": "What are the working hours?"})
print(result["answer"])
print(result["source_documents"])  # Citations

Embedding Models

Choosing the right embedding model is critical:

ModelDimensionsSpeedQualityLicense
all-MiniLM-L6-v2384FastGoodApache 2.0
all-mpnet-base-v2768MediumBetterApache 2.0
e5-large-v21024SlowBestMIT
OpenAI ada-0021536APIExcellentCommercial

For on-premise deployment, Hugging Face models provide excellent quality without data leaving your infrastructure:

from sentence_transformers import SentenceTransformer

# Load model locally
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Generate embeddings
embeddings = model.encode(documents, show_progress_bar=True)

Scaling Embeddings

For high-throughput scenarios, deploy embedding models as a service:

# Kubernetes deployment for embedding service
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-service
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: embeddings
          image: huggingface/text-embeddings-inference:latest
          args:
            - --model-id=sentence-transformers/all-mpnet-base-v2
            - --max-batch-requests=32
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8080

Vector Database Selection

Common options for production:

DatabaseStrengthsConsiderations
MilvusScalable, feature-richComplex setup
QdrantFast, easy APINewer ecosystem
WeaviateGraphQL, modulesResource intensive
ChromaSimple, embeddedLimited scale
pgvectorPostgreSQL nativeBasic features

Collection Design

Organize vectors by domain for better retrieval:

# Separate collections by document type
collections = {
    "hr_policies": {
        "description": "HR policies and procedures",
        "chunk_size": 512,
        "embedding_model": "all-mpnet-base-v2"
    },
    "product_docs": {
        "description": "Product documentation",
        "chunk_size": 1024,
        "embedding_model": "all-mpnet-base-v2"
    },
    "faq": {
        "description": "Frequently asked questions",
        "chunk_size": 256,
        "embedding_model": "all-MiniLM-L6-v2"  # Faster for short content
    }
}

LLM Provider Options

Cloud Providers

Azure OpenAI

from langchain.llms import AzureOpenAI

llm = AzureOpenAI(
    deployment_name="gpt-4",
    api_version="2024-02-15-preview",
    temperature=0.1,
    max_tokens=1000
)

AWS Bedrock

from langchain.llms import Bedrock

llm = Bedrock(
    model_id="anthropic.claude-3-sonnet",
    model_kwargs={"temperature": 0.1}
)

On-Premise Options

For data privacy requirements, run models locally:

Ollama

from langchain.llms import Ollama

llm = Ollama(
    model="llama3:70b",
    base_url="http://ollama-service:11434"
)

vLLM (High Performance)

# Kubernetes deployment for vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=meta-llama/Llama-3-70b-chat-hf
            - --tensor-parallel-size=4
          resources:
            limits:
              nvidia.com/gpu: 4

Chat Service Implementation

The Chat Service orchestrates the entire flow:

@Service
public class ChatService {

    private final VectorStoreClient vectorStore;
    private final LlmClient llmClient;
    private final ConversationRepository conversationRepo;

    public ChatResponse processMessage(ChatRequest request) {
        // 1. Load conversation history
        Conversation conversation = conversationRepo
            .findById(request.getConversationId())
            .orElseGet(Conversation::new);

        // 2. Retrieve relevant context
        List<Document> relevantDocs = vectorStore.search(
            request.getMessage(),
            request.getCollections(),
            5  // top_k
        );

        // 3. Build prompt with context
        String prompt = buildPrompt(
            request.getMessage(),
            relevantDocs,
            conversation.getHistory()
        );

        // 4. Generate response
        String response = llmClient.generate(prompt);

        // 5. Save to conversation history
        conversation.addMessage(request.getMessage(), response);
        conversationRepo.save(conversation);

        return ChatResponse.builder()
            .message(response)
            .sources(relevantDocs.stream()
                .map(Document::getSource)
                .collect(toList()))
            .conversationId(conversation.getId())
            .build();
    }
}

Security Considerations

Authentication Flow

Integrate with enterprise identity:

User → Chat UI → Keycloak → JWT → Chat Service → LLM
                    ↓
              Validate token
              Check permissions
              Log access

Data Protection

  1. Input Sanitization: Prevent prompt injection
def sanitize_input(user_input: str) -> str:
    # Remove potential injection patterns
    dangerous_patterns = [
        "ignore previous instructions",
        "disregard above",
        "system prompt"
    ]
    for pattern in dangerous_patterns:
        user_input = user_input.replace(pattern, "[FILTERED]")
    return user_input
  1. Output Filtering: Prevent data leakage
def filter_response(response: str, user_role: str) -> str:
    if user_role != "admin":
        # Redact sensitive patterns
        response = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[REDACTED]', response)
    return response
  1. Audit Logging
@Aspect
@Component
public class ChatAuditAspect {

    @Around("@annotation(Audited)")
    public Object auditChat(ProceedingJoinPoint joinPoint) {
        ChatRequest request = (ChatRequest) joinPoint.getArgs()[0];

        auditLog.info("Chat request: user={}, message_hash={}, collections={}",
            SecurityContextHolder.getContext().getAuthentication().getName(),
            hashMessage(request.getMessage()),
            request.getCollections()
        );

        return joinPoint.proceed();
    }
}

Kubernetes Deployment

High Availability Architecture

Kubernetes Architecture

Deploy across multiple availability zones:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chat-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: chat-service
              topologyKey: topology.kubernetes.io/zone
      containers:
        - name: chat-service
          image: chat-service:1.0.0
          resources:
            requests:
              memory: "2Gi"
              cpu: "1000m"
            limits:
              memory: "4Gi"
              cpu: "2000m"
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080

GPU Scheduling for LLM

apiVersion: v1
kind: Pod
metadata:
  name: llm-inference
spec:
  containers:
    - name: vllm
      image: vllm/vllm-openai:latest
      resources:
        limits:
          nvidia.com/gpu: 4
      volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
  nodeSelector:
    accelerator: nvidia-a100
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Performance Optimization

Caching Strategies

from functools import lru_cache
import redis

redis_client = redis.Redis(host='redis', port=6379)

# Cache embeddings for repeated queries
@lru_cache(maxsize=10000)
def get_embedding_cached(text: str) -> list:
    return embedding_model.encode(text).tolist()

# Cache LLM responses for identical queries + context
def get_cached_response(query_hash: str) -> Optional[str]:
    return redis_client.get(f"llm:response:{query_hash}")

def cache_response(query_hash: str, response: str, ttl: int = 3600):
    redis_client.setex(f"llm:response:{query_hash}", ttl, response)

Batching Requests

# Batch multiple embedding requests
async def batch_embed(texts: List[str], batch_size: int = 32):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        batch_embeddings = await embedding_service.embed_batch(batch)
        embeddings.extend(batch_embeddings)
    return embeddings

Monitoring and Observability

Key Metrics

MetricDescriptionAlert Threshold
Response latency (p95)End-to-end time> 5s
Retrieval qualityRelevance score< 0.6
LLM token usageTokens per requestBudget based
Cache hit rateEmbedding cache< 50%
Error rateFailed requests> 1%

Tracing with OpenTelemetry

from opentelemetry import trace
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

# Instrument LangChain
LangchainInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_chat")
def process_chat(message: str):
    with tracer.start_as_current_span("retrieve_context"):
        context = retrieve_documents(message)

    with tracer.start_as_current_span("generate_response"):
        response = llm.generate(build_prompt(message, context))

    return response

Evaluation and Testing

RAG Quality Metrics

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall

# Evaluate RAG pipeline
evaluation_data = {
    "question": ["What is the return policy?"],
    "answer": ["Items can be returned within 30 days..."],
    "contexts": [["Policy document excerpt..."]],
    "ground_truth": ["30-day return policy for all items"]
}

results = evaluate(
    evaluation_data,
    metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results)
# faithfulness: 0.92
# answer_relevancy: 0.88
# context_recall: 0.95

Load Testing

# k6 load test script
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 50 },
    { duration: '5m', target: 100 },
    { duration: '2m', target: 0 },
  ],
};

export default function () {
  const response = http.post('http://chat-service/api/chat', {
    message: 'What are the working hours?',
    conversationId: 'test-conversation'
  });

  check(response, {
    'status is 200': (r) => r.status === 200,
    'response time < 5s': (r) => r.timings.duration < 5000,
  });

  sleep(1);
}

Conclusion

Building an enterprise LLM chatbot with RAG requires careful attention to:

  1. Architecture: Separate concerns—ingestion, retrieval, generation, and serving
  2. Data quality: Chunking strategy and embedding model selection significantly impact accuracy
  3. Security: Implement authentication, input sanitization, and audit logging
  4. Performance: Cache aggressively, batch requests, and optimize retrieval
  5. Observability: Monitor latency, quality metrics, and costs

The RAG pattern enables enterprises to leverage LLM capabilities while maintaining control over their data and ensuring accurate, grounded responses.


Enterprise LLM Chatbot: RAG Architecture and Implementation

A guide to production-ready conversational AI.

Achraf SOLTANI — June 15, 2024