RAGtronic: Building a Production AI Platform in Rust, Multi-Model Orchestration, Zero-Trust Auth, and Making LLMs Speak Creole

20 min read
rustactix-webragory-kratosory-oathkeeperauthentikqdrantlitellmnemo-guardrailsopenai-compatiblemulti-tenantsplunkmauritian-creolevoice-to-rag

RAGtronic started as a question: what would an AI backend look like if you built it the way enterprise infrastructure demands, observable, multi-tenant, zero-trust, and designed to degrade gracefully when individual components fail?

Most AI backends are Python scripts wrapped around an API call. RAGtronic is something different: ~28,000 lines of Rust powering a multi-model orchestration layer, a content safety pipeline with automatic degradation, per-tenant RAG isolation, and an OpenAI-compatible gateway that lets any tool in the ecosystem talk to it natively. It runs as part of a broader AI infrastructure stack, one of many services across 70+ containers on a Proxmox virtualisation cluster.

An obvious question: why Docker Compose and not Kubernetes? The answer is that this is a deliberate architectural choice, not a limitation. I run Kubernetes elsewhere in the lab for other workloads (and have written about it), but the AI stack has a specific constraint that makes Compose the better fit right now. The entire stack runs on a dedicated Proxmox VM with 64GB of RAM, 16 vCPUs across dual Xeon Gold 6138 processors, and an NVIDIA Quadro M6000 24GB passed through via PCI-e. That single GPU handles local model inference, embedding generation, and GPU-accelerated document processing. With one GPU dedicated to one workload, the scheduling and device-sharing complexity that Kubernetes brings (GPU operator, time-slicing, MIG partitioning) adds overhead without adding value. Compose gives me faster iteration cycles, instant hot-reload during development, and a deployment model where every container is already defined with the right GPU reservations, volume mounts, and network policies. The stack is fully containerised with no host dependencies, so the path to Kubernetes is there whenever the workload demands it, a second GPU, multi-node scaling, or automated failover across hosts. Until then, Compose keeps the complexity proportional to the problem.

This post covers the architecture, the decisions behind it, and some of the more interesting engineering challenges, like making a model that was never trained on Creole actually hold a conversation in it.

Oh, and if you want to skip straight to testing: that AI assistant button on this site? That's RAGtronic. You're looking at the production system right now. Go ahead, ask it something.


The Migration: Python to Rust

RAGtronic didn't start as Rust. The first version was Python, FastAPI, the usual stack. It worked, but as the feature set grew, the problems compounded: memory usage climbing with each concurrent stream, GC pauses interrupting SSE responses mid-token, and a general unease about deploying something that consumed 500MB+ at idle for what was essentially a proxy with business logic.

The rewrite to Rust with Actix-web was motivated by three things:

  1. Streaming fidelity: When you're proxying Server-Sent Events from an LLM, every pause matters. Garbage collection pauses during streaming create visible stutters in the frontend. Rust's zero-cost abstractions and lack of GC eliminate this entirely.

  2. Resource predictability: The Rust backend idles at ~50MB. On shared infrastructure where every container competes for memory, this matters. The Python version consumed 10x that before handling a single request.

  3. Compile-time guarantees: Every API contract, every database query shape, every serialisation boundary is checked before deployment. When you're running a platform where a malformed response can cascade into a broken UI, the compiler is the most cost-effective QA engineer you'll ever hire.

The Python version still exists as a python-legacy branch. The NeMo Guardrails service, which couldn't be ported to Rust because it's tightly coupled to NVIDIA's Python runtime, became a sidecar instead. That constraint turned into one of the better architectural decisions in the project.


Architecture Overview

RAGtronic runs as a containerised stack orchestrated with Docker Compose. The core principle is separation of concerns: authentication lives outside the application, content safety runs as a sidecar, model routing goes through a dedicated proxy, and the Rust backend focuses purely on business logic.

RAGtronic Architecture Overview

Every component is independently deployable and replaceable. The backend doesn't know how authentication works, it receives pre-validated headers. It doesn't know which LLM provider is active, it talks to a unified proxy. It doesn't know the specifics of content safety rules, it calls a sidecar and respects the verdict. This separation has paid dividends: I've swapped LLM providers, updated guardrail rules, and rotated auth configurations without touching backend code.


Dual Access Profiles

RAGtronic exposes two distinct access profiles for the backend, a design pattern borrowed from how platforms like Splunk separate management ports from data ingestion ports:

PortProfilePurpose
8888External (Public)Customer-facing API, routed through Cloudflare Workers for edge security, rate limiting, and DDoS protection before hitting Traefik. Guardrails are enforced on every request.
8889Internal (Admin)Direct LAN access for internal integrations, Open WebUI, admin tooling, debugging. Serves an OpenAI-compatible endpoint that any tool in the ecosystem can connect to.
4459GatewayOry Oathkeeper proxy, all authenticated frontend traffic flows through here for session validation and header injection.

Both ports map to the same Rust process on port 8080 inside the container. The difference is what sits in front of each.

Why two profiles? Because the same RAG pipeline serves two fundamentally different use cases. The external profile powers a public-facing AI assistant on a website, rate-limited, guardrailed, cost-tracked. The internal profile reuses that same RAG context (the same vector collections, the same prompt configurations) but exposes it as an OpenAI-compatible endpoint for internal consumption. Point Open WebUI at it, connect Cursor or Continue.dev, wire up a LangChain pipeline, it speaks the same protocol. One RAG investment, two access patterns.


Multi-Model Orchestration: 84 Models, 10 Providers

RAGtronic doesn't call LLM providers directly. All model traffic flows through a LiteLLM proxy which provides a unified interface across providers.

The current deployment has 84 unique models available across 10 providers:

ProviderModelsExamples
Anthropic14Claude Opus 4-6, Sonnet 4.5, Haiku 4.5
OpenAI17GPT-5.4, GPT-5.2 Codex, GPT-5
Google6Gemini 3 Pro, Gemini 2.5 Pro, Flash
DeepSeek7DeepSeek V3.2, R1, V3.1
Qwen13Qwen3 Max, Qwen3 Coder Plus, Qwen3-235B
Kimi12Kimi K2, K2 Turbo, Moonshot v1
Cloudflare AI4Workers AI-hosted models
Ollama8Gemma 3, Llama 3.1, Phi-4
GLM2GLM-4.6, GLM-4.5
MiniMax2MiniMax M2

LiteLLM handles the complexity that would otherwise pollute the backend:

  • Unified API surface: One endpoint, any model. The backend sends a standard chat completion request; LiteLLM routes it to the appropriate provider with the correct auth headers, API format, and token counting.
  • Automatic failover: Models can be configured with fallback chains. If a primary model's API returns errors, LiteLLM automatically routes to the next in the chain. Some models have duplicate entries specifically for this, 100 total rows in the routing table for 84 unique models.
  • Cost tracking: Per-request token counting and cost calculation across all providers. The admin dashboard surfaces this in real time.
  • Key rotation and rate limit handling: Multiple API keys per provider with automatic retry and exponential backoff.

The admin UI lets operators configure providers, set default models per profile, and monitor spend in real time. Model selection happens at the profile level, so different AI profiles can target different models. A cost-conscious external-facing profile might use a smaller model while the internal research profile has access to the full catalogue.


Content Safety: Dual-Layer Guardrails

Content safety is where RAGtronic's engineering gets interesting. Rather than bolting on a single safety check, the platform implements a dual-layer guardrails architecture with automatic degradation.

Layer 1: NVIDIA NeMo Guardrails (Primary)

The primary defence is an NVIDIA NeMo Guardrails sidecar, a Python service running alongside the Rust backend. NeMo Guardrails provides:

  • Prompt injection detection: LLM-powered classification of whether an input is attempting to manipulate system behaviour
  • Jailbreak attempt blocking: Pattern recognition and semantic analysis of adversarial prompts
  • Output filtering: Post-generation validation to ensure responses don't contain sensitive data or harmful content
  • Configurable Colang rules: Domain-specific safety policies defined in NVIDIA's Colang language, with custom actions for filtering

The guardrails sidecar calls LiteLLM for its own classification tasks, creating a defence-in-depth pattern: a separate LLM evaluates whether the user's prompt is safe before the main LLM processes it. This means the classification model and the generation model are independent, so compromising one doesn't compromise the other.

Layer 2: Compiled Pattern Engine (Fallback)

Here's the engineering problem: the NeMo sidecar is a separate process. It can go down. Network calls can time out. If your only safety layer is an external service, a sidecar restart means your platform is temporarily unprotected.

The Rust backend includes a compiled-in pattern matching engine as an automatic fallback. When the NeMo sidecar is unreachable (5-second timeout), the backend seamlessly switches to this engine, with no downtime and no unprotected requests.

The fallback engine includes pattern categories for injection attempts, adversarial token sequences, and unsafe content generation. Every input and output passes through validation, validate_input_for_chat() and validate_output_for_chat(), which try the NeMo sidecar first and fall back to pattern matching with a warning logged.

The health endpoint reports the current state:

json
{
  "status": "protected",
  "source": "nemo_guardrails",
  "sidecar_available": true
}

Or in degraded mode:

json
{
  "status": "degraded",
  "source": "pattern_fallback",
  "sidecar_available": false,
  "message": "NeMo sidecar unreachable — pattern-based validation active"
}

Trusted Source Bypass

Not every request should hit guardrails. When the frontend sends a code block for explanation, a legitimate operation where the content might contain code patterns that look like injection attempts, it includes a trusted_source flag. This tells the guardrails layer that the content is user-initiated from a verified frontend session, not an external API call. The flag is only honoured for specific operations and is validated against the session context. This prevents false positives on legitimate code analysis without opening a bypass vector for external callers.

Why a Sidecar?

The guardrails sidecar wasn't a choice, it was a constraint that became an advantage. NeMo Guardrails is deeply embedded in the Python/NVIDIA ecosystem and can't be compiled to Rust. Running it as a sidecar means:

  1. Independent scaling: Guardrails can be horizontally scaled without touching the backend
  2. Hot-reload rules: Update Colang safety policies without redeploying the backend
  3. Failure isolation: A sidecar crash doesn't take down the backend (the fallback engine activates)
  4. Language flexibility: The right tool in the right runtime: Python for ML classification, Rust for request handling

Teaching LLMs Mauritian Creole

This is probably the most personally meaningful feature in RAGtronic.

Mauritian Creole (Kreol Morisien) is my native language. It's spoken by roughly 1.3 million people. Most large language models have never been trained on it. They'll either refuse to respond in Creole, confuse it with Haitian Creole, or produce grammatically incorrect output that sounds like a translation phrasebook.

RAGtronic solves this through RAG, specifically by ingesting a comprehensive linguistic corpus and making it available as retrieval context:

The Corpus

  • 33,774 dictionary entries from the Lalit Mauritian Creole dictionary, the most comprehensive Kreol Morisien lexicon available
  • 21 chunked index files organised alphabetically for efficient vector retrieval
  • A critical differentiation document that explicitly maps the grammatical differences between Haitian Creole and Mauritian Creole, covering pronoun systems, verb conjugation patterns, key expressions, and structural divergences

That last document is particularly important. Without it, models default to Haitian Creole patterns when they detect "Creole" in the context, because Haitian Creole has significantly more training data representation. The differentiation document acts as a linguistic anchor that keeps the model grounded in the correct variant.

Language Detection and Prompt Injection

The profile middleware includes a Kreol language detector, a keyword-based system that recognises Mauritian Creole markers in user input (words like mo, to, li, nou, zot, koze, bonzour, ki manyer). When Creole is detected and the active profile has Creole mode enabled, the system injects a specialised prompt:

"The user is communicating in Mauritian Creole (Kreol Morisien). Respond in Mauritian Creole..."

This prompt, combined with the RAG context from the dictionary corpus, gives a model enough linguistic scaffolding to construct grammatically correct Kreol Morisien responses, even when the model was never trained on the language.

The result: you can have a conversation in Mauritian Creole with Claude, GPT-4, or even a Cloudflare Workers AI model that definitely doesn't have Kreol in its training data. The RAG context bridges the gap. It's not perfect, and it's never going to pass for a native speaker in long-form prose, but for conversational exchanges, it works remarkably well.

Demo Mode Interaction

Creole mode is disabled when Demo Mode is active (is_creole_enabled() checks enable_creole && !demo_mode). This ensures public demonstrations produce predictable, portfolio-safe output without unexpected language switching.


AI Behaviour Profiles

RAGtronic doesn't have a single personality, it has a profile system that governs how the AI behaves per-request. Each profile is a configuration object stored in PostgreSQL with toggles for:

FeatureDescription
System PromptsEnable/disable custom system prompt injection from the prompt library
Creole ModeActivate Mauritian Creole language detection and response generation
Joke InjectionAppend programming jokes to code explanations (sourced from JokeAPI v2, safe-mode enforced)
Demo ModeLock the platform to portfolio-safe behaviour, disabling Creole, jokes, and experimental features
RAG ModeControls vector search context injection: auto (detect when relevant), always, never, or on_demand
Input/Output RailsPer-profile guardrails enforcement toggles
Jailbreak DetectionEnable/disable the jailbreak classification layer

Profiles also carry access method configuration, controlling how the profile is activated. A profile can be bound to a specific API key, a port, a subdomain, or a custom header. This enables multi-tenant scenarios where different API consumers get different AI behaviours without any code changes.

The active profile is resolved per-request via middleware. The profile detector inspects the incoming request, matches it against configured access patterns, and loads the corresponding profile settings. Every downstream operation, including prompt construction, RAG injection, guardrails enforcement, and model selection, consults the active profile.

Profiles track their own usage: total requests, success/failure counts, token consumption, and estimated cost. This data surfaces in the admin dashboard for per-profile analytics.


Zero-Trust Authentication: Ory + Authentik SSO

Authentication is one of the areas where RAGtronic diverges most from typical AI projects. Instead of a JWT middleware or basic auth, the platform implements a full zero-trust architecture where the backend never handles authentication logic directly.

To be clear about the role of SSO here: RAGtronic is not a multi-user chat application where individual users each get their own workspace. It's a documentation-stack RAG platform, an AI orchestration layer that connects to documentation sites and knowledge bases. The SSO layer exists for platform operators, the people who administer the system: configuring AI profiles, managing model routing, tuning guardrail policies, monitoring spend, and operating the RAG pipeline. End users interact with RAGtronic through its API or embedded interfaces, authenticated via API keys or session tokens depending on the integration. The admin UI, where you configure everything, is what sits behind enterprise SSO.

Ory Kratos (Identity)

Kratos is the identity server. It handles registration, login, password recovery, and email verification flows. Key features in this deployment:

  • Authentik OIDC integration: Operators authenticate via enterprise SSO through Authentik, which acts as the OIDC provider. This means RAGtronic inherits whatever identity infrastructure the organisation already runs.
  • Email allowlist enforcement: A webhook validates registration attempts against a pre-approved email list. Unapproved emails are rejected before account creation.
  • Session lifecycle management: Cookie-based sessions with configurable TTLs, CSRF protection, and secure session storage in a dedicated PostgreSQL instance.

Ory Oathkeeper (Zero-Trust Gateway)

Oathkeeper sits in front of all traffic as a reverse proxy that validates every request:

  • Public paths (static assets, login page) pass through unauthenticated
  • API endpoints with session cookies are validated against Kratos. If the session is valid, Oathkeeper injects X-User-Id and X-User-Email headers and forwards the request
  • API endpoints with bearer tokens go through API key validation
  • Everything else is denied

RAGtronic Zero-Trust Auth Flow

The Rust backend receives requests with identity already resolved in headers. It never parses cookies, validates tokens, or manages sessions. This separation means authentication bugs don't create backend vulnerabilities, and auth infrastructure can be updated independently.

Why Both Ory and Authentik?

Authentik is the SSO provider. It's the central identity platform that handles federation, MFA, and enterprise directory integration. Ory handles the application-level concerns: per-request session validation, zero-trust header injection, and the stateless gateway pattern. They complement each other rather than overlap.


Observability: Built from Day One

Observability wasn't bolted on after deployment, it was a day-one design requirement. Working in enterprise observability (I work at Splunk), I've seen what happens when platforms treat logging as an afterthought. RAGtronic integrates with multiple backends:

Splunk Integration

The Splunk integration is first-class, with full CRUD management of Splunk HTTP Event Collector (HEC) configurations through the admin UI:

  • Configure HEC endpoints, tokens, source types, and indices per deployment
  • Per-tenant Splunk configurations: each tenant can have their own Splunk destination, enabling data sovereignty and compliance with per-customer log forwarding requirements
  • Connection testing: the admin UI includes a live connection test that sends a real HEC event, captures the response time, and extracts the Splunk version from response headers
  • Stats tracking: per-configuration metrics including total events sent, failures, bytes transmitted, and success rates

The data flowing into Splunk covers the full request lifecycle: who called what, which model responded, how many tokens were consumed, what the latency was, and whether guardrails flagged anything. This creates an audit trail that satisfies enterprise compliance requirements while giving operators the visibility they need for cost management and capacity planning.

LangFuse Tracing

Every chat completion generates a LangFuse trace with:

  • Session and conversation IDs for multi-turn thread tracking
  • Token counts and cost calculation per request
  • Latency metrics broken down by phase (prompt construction, model call, post-processing)
  • Guardrails classification results
  • RAG retrieval metadata (which chunks were used, relevance scores)

Structured Logging

The Rust backend uses the tracing crate with JSON-structured output. Every log entry includes request IDs, user identity, active profile, and timing data. This makes logs parseable by any aggregation platform without custom parsing rules.

Conversation Analytics

The admin dashboard includes a Conversations page that surfaces:

  • All sessions with token counts, costs, and latency
  • Client IP tracking for abuse detection
  • LangFuse trace IDs linked directly to the tracing UI
  • Model and provider breakdown per conversation
  • Filterable by time range, user, model, and cost threshold

Voice-to-RAG: LLM Studio

The admin dashboard includes an LLM Studio: an interactive chat interface that goes beyond basic text input:

  • Voice input: Direct microphone capture that transcribes speech and feeds it into the RAG pipeline. In the GPU deployment profile, this runs through Faster Whisper locally; in cloud mode, it routes to cloud STT providers.
  • Text-to-speech output: Responses can be read aloud with configurable voice selection. GPU mode supports Coqui XTTS (with voice cloning via KNN-VC), Dia2, NeuTTS-Air, and Piper; cloud mode uses ElevenLabs or OpenAI TTS APIs.
  • Model selection: Switch between any of the 84 available models mid-conversation
  • Memory controls: Toggle persistent conversation memory, clear context, or start fresh sessions
  • Streaming: Real-time token-by-token response rendering via Server-Sent Events

The voice pipeline creates a full loop: speak a question, have it transcribed, run it through RAG with the active profile's configuration (including Creole detection if enabled), generate a response, and optionally speak it back. In GPU mode with local models, this entire pipeline runs without any external API calls.


Code Enrichment Pipeline

When a user asks RAGtronic to explain a code block, the response isn't just the model's native knowledge, it's enriched through a multi-source pipeline:

  1. Guardrails check: The code block is validated. When the request originates from the frontend's code explanation feature, it carries a trusted_source marker to prevent false positive injection detection on legitimate code patterns.
  2. External context gathering: The platform fetches additional context from external developer knowledge sources to supplement the model's understanding. This adds real-world usage patterns, known issues, and community knowledge to the explanation.
  3. Response enrichment: For qualifying requests, a programming joke is appended (when the active profile has joke injection enabled). Jokes are sourced from JokeAPI v2 with the Programming category and safe-mode filter, no offensive content.
  4. Markdown post-processing: The response goes through a cleanup pass that fixes unclosed code fences, adds language tags to bare code blocks, and normalises formatting inconsistencies that models sometimes produce.

This pipeline runs transparently, and the user sees a well-formatted, context-rich explanation without knowing how many sources contributed to it.


Document Indexing Pipeline

RAG is only as good as the data behind it. RAGtronic's indexing pipeline handles the full lifecycle from raw document to searchable vector, and it's designed around S3-compatible object storage so the document source is provider-agnostic.

Ingestion from Object Storage

Documents land in S3-compatible object storage: the same API works whether the backend is MinIO (self-hosted), AWS S3, or Cloudflare R2. The platform manages multiple S3 configurations through the admin UI, each with its own endpoint, bucket, prefix, and credentials. Connection testing, usage stats (uploads, downloads, bytes transferred), and default selection are all handled through the API. This decoupling means documentation sources can live anywhere that speaks the S3 protocol without touching backend code.

Document Processing with Docling

Raw documents (PDFs, DOCX, scanned images) pass through Docling, IBM's document AI service running as a sidecar. Docling handles:

  • Format conversion: PDF, DOCX, and image files are converted to clean Markdown while preserving structure (headings, tables, lists, code blocks)
  • OCR: Optical character recognition for scanned documents and images, with configurable batch sizes for OCR, layout analysis, and table extraction
  • VLM-enhanced processing: Optional Vision-Language Model integration for complex document understanding, where a VLM endpoint can be configured to assist with layout interpretation
  • Deployment presets: Four preset configurations (cpu_small, cpu_medium, gpu_small, gpu_large) that tune batch sizes from 4 (conservative CPU) to 64 (high-VRAM GPU), with device mode and batch parameters adjustable through the admin UI or API

The Docling sidecar is proxied through the Rust backend, so the admin UI's Documents page provides a unified interface for uploading, converting, and managing documents without directly exposing the Docling service.

Chunking and Embedding

Once converted to Markdown, documents are split into overlapping chunks for vector storage. The chunking strategy uses character-based windowing: 1,000-character chunks with 200-character overlap, ensuring context is preserved at chunk boundaries. Chunks smaller than 50 characters are discarded to avoid noise.

Each chunk is then embedded using mxbai-embed-large via Ollama for local inference, or through LiteLLM for cloud-hosted embedding models. The embedding vectors (1,024 dimensions for mxbai-embed-large) are stored with metadata: the source file path, chunk index, and total chunk count for the document.

Vector Storage and Per-Tenant Isolation

Vectors are stored in Qdrant using Cosine distance for similarity measurement. The critical design decision here is per-tenant collection isolation: each tenant (or documentation stack) gets its own Qdrant collection, named tenant_{slug}. When a tenant is provisioned, the backend automatically creates the corresponding Qdrant collection. When a tenant is deprovisioned, the collection is cleaned up. This provides hard isolation between document sets without any cross-contamination risk.

Hybrid Search and Reranking

At query time, RAGtronic doesn't rely on pure vector similarity. The search pipeline supports:

  1. Semantic search: Standard vector similarity against the query embedding
  2. Keyword search: Payload-based text matching with stop-word filtering, catching results that semantic search might miss (exact terms, product names, error codes)
  3. Reciprocal Rank Fusion (RRF): Results from both search methods are merged using RRF with a ranking constant of 60, producing a unified ranked list that benefits from both approaches
  4. Source deduplication: Multiple chunks from the same source document are collapsed to the highest-scoring result, preventing a single document from dominating the context window
  5. Cross-encoder reranking: When enabled, results are re-scored using a cross-encoder model through LiteLLM's rerank endpoint, improving precision by evaluating query-document relevance pairs rather than relying solely on embedding similarity

The search mode (semantic only or hybrid) and reranking are controlled via environment flags, so they can be tuned per deployment without code changes.


OpenAI-Compatible API Gateway

RAGtronic implements the OpenAI Chat Completions API specification. Any tool that speaks OpenAI's protocol can point at RAGtronic and use it as a drop-in backend:

  • Cursor / Continue.dev: IDE-integrated AI assistants using RAGtronic's model catalogue and RAG context
  • Open WebUI: Self-hosted chat interface connected to the internal profile
  • LangChain / LlamaIndex: Programmatic orchestration frameworks
  • Custom applications: Anything that can send a POST /v1/chat/completions request

The gateway:

  1. Accepts standard OpenAI-format requests (model, messages, temperature, stream)
  2. Routes through the active LLM provider via LiteLLM
  3. Optionally injects RAG context from Qdrant based on the active profile's RAG mode
  4. Supports full streaming via Server-Sent Events (stream: true)
  5. Returns OpenAI-format responses with usage statistics (prompt tokens, completion tokens, total cost)

This makes RAGtronic a centralised AI gateway: all AI traffic flows through one point with unified logging, cost tracking, content safety, and access control. The internal profile on port 8889 serves this exact purpose: one RAG investment powering every AI tool in the stack.


Documentation-First APIs

Every API endpoint in RAGtronic is documented at the source level using utoipa: Rust's OpenAPI specification generator. The #[openapi()] macro on every handler generates a live OpenAPI spec that powers:

  • Swagger UI at /swagger-ui/: Interactive API explorer where you can test endpoints directly from the browser
  • OpenAPI JSON at /api-docs/openapi.json: Machine-readable spec for code generation and client SDK creation
  • Security scheme documentation: Both API key (header-based) and session cookie authentication are documented in the spec with examples

The API design philosophy is that every endpoint should be self-describing. Request and response types derive ToSchema, which means the spec includes full type information, example values, and field descriptions. If you can read the Swagger UI, you can integrate with RAGtronic without reading a line of backend code.

This approach feeds directly into the admin dashboard's API Explorer page, an embedded Swagger UI that gives operators a live playground for testing any endpoint with their current session credentials.


Deployment Profiles: Cloud vs GPU

RAGtronic supports two deployment profiles managed through separate Docker Compose configurations and git branches:

Cloud Profile (Current Deployment)

The production deployment runs without GPU reservation. Voice, TTS, and document processing use cloud APIs:

CapabilityCloud Implementation
Speech-to-TextCloud STT providers
Text-to-SpeechElevenLabs / OpenAI TTS APIs
Voice CloningNot available
EmbeddingsOpenAI / Ollama
Document OCRDocling (CPU mode)
LLM Inference84 models via LiteLLM (cloud providers + Ollama local)

GPU Profile (Development Branch)

The GPU branch adds seven additional services for fully local AI processing:

ServicePurposeDetails
Faster WhisperSpeech-to-textLocal Whisper inference with GPU acceleration
Coqui XTTSText-to-speechVoice cloning and synthesis with custom speaker embeddings
KNN-VCVoice conversionAccent and voice characteristic preservation across TTS outputs
Dia2Streaming TTSNatural conversational speech synthesis
NeuTTS-AirHigh-quality TTSElevenLabs-quality open-source alternative with on-demand model loading
Liquid AudioAudio-to-audio LLMReal-time voice-in, voice-out language model interaction
Piper GPUFast TTS fallbackSub-second latency TTS for real-time applications

The GPU profile enables a full voice-to-voice AI pipeline: speak a question, transcribe it locally with Whisper, process it through the RAG pipeline, generate a response, synthesise speech with the user's cloned voice, and output audio, all without a single external API call. Every byte of data stays on-premises.


The Admin Dashboard

The frontend is a React + Vite single-page application (~14,500 lines) with a custom design system built on a dark theme with cyan accent colours, JetBrains Mono headings, and Geist body typography. It's designed to feel like a professional operations console rather than a generic admin template.

PagePurpose
OverviewReal-time metrics, including active sessions, token usage, cost tracking, system health status, guardrails state
LLM StudioInteractive chat with voice input, model switching, memory controls, streaming, and TTS output
AI ConfigTabbed configuration: prompts (Monaco editor), providers, integrations, rate limits, security, guardrails, RAG, and API keys
ConversationsSession analytics, including token counts, costs, latency, client IPs, LangFuse trace links
DocumentsDocling integration, upload, convert (PDF/DOCX to Markdown), chunk, embed, and manage documents
ProfilesAI behaviour profile management with per-feature toggles and usage analytics
API ExplorerLive Swagger UI for testing any endpoint
Dubstack StacksMulti-tenant infrastructure dashboard, including storage modes, S3 endpoints, Qdrant collection status, sync operations
SettingsPlatform configuration, Splunk HEC management, RAG model selection, endpoint configuration

Broader Infrastructure Context

RAGtronic doesn't run in isolation. It's one component of a larger AI infrastructure stack running on a Proxmox virtualisation cluster with 70+ containers across multiple services:

  • LiteLLM: Model routing and cost management (separate deployment from RAGtronic's internal proxy)
  • Ollama: Local model hosting for open-weight models
  • LangFuse: Tracing and prompt management
  • Open WebUI: Alternative chat interface connected to RAGtronic's internal API
  • SearxNG: Privacy-respecting meta search
  • Docling: Document AI for OCR and format conversion
  • Multiple MCP servers: Tool integrations for code generation, search, and automation
  • Qdrant: Dedicated vector database instance (isolated port range to avoid collisions with other Qdrant deployments on the network)

RAGtronic serves as the AI orchestration layer within this stack. It doesn't replace these services but unifies access to them through a single, authenticated, observable API surface.


Key Design Decisions

Why Rust for an AI Backend?

AI backends are overwhelmingly Python. Choosing Rust was deliberate:

  1. Streaming performance: SSE forwarding of LLM responses with zero-copy I/O. No GC pauses, no memory spikes during concurrent streams.
  2. Concurrency model: Actix-web's async runtime handles thousands of concurrent connections without thread pool exhaustion. Each streaming response is a lightweight future, not a thread.
  3. Compile-time API contracts: Request/response schemas are enforced by the type system. A mismatched field name or wrong type is caught at compile time, not in production.
  4. Operational footprint: 50MB at idle for ~28,000 lines of business logic. On shared infrastructure, resource efficiency translates directly to cost savings and deployment density.

Why Ory for Authentication Instead of Rolling Custom?

The Ory stack (Kratos + Oathkeeper) brings authentication logic outside the application:

  1. Zero-trust by default: Every request is validated at the gateway. The backend code contains zero authentication logic, and it trusts the identity headers injected by Oathkeeper after validation.
  2. Session management expertise: Cookies, CSRF, session rotation, and token lifecycle are handled by purpose-built software. This is not a solved problem you want to re-solve in application code.
  3. Webhook extensibility: Registration validation, email allowlists, and custom flows are implemented as webhooks without modifying the identity server.

Why Not Just Use Authentik for Everything?

Authentik is the SSO federation layer. It handles OIDC, enterprise directory integration, and multi-factor authentication. Ory handles the per-request, application-level concerns: stateless session validation at the gateway, identity header injection, and the zero-trust proxy pattern. They operate at different layers of the stack and complement each other.


What's Next

  • Tenant isolation hardening: Full namespace isolation per tenant in Qdrant with dedicated collections and access controls
  • GPU profile production deployment: Local TTS/STT with voice cloning for on-premises voice AI
  • Webhook notification channels: Slack, Teams, and Discord integrations for alerting and event forwarding
  • Immutable audit logging: Append-only audit trail for compliance requirements
  • Expanded Creole corpus: Additional linguistic data and idiom coverage to improve conversational naturalness

Screenshots

Screenshots of the RAGtronic admin dashboard, LLM Studio, AI Config, Conversations analytics, and Profiles management are available in the gallery below.


Stack: Rust (Actix-web) | React (Vite) | PostgreSQL 16 | Qdrant | LiteLLM | Ory Kratos | Ory Oathkeeper | Authentik SSO | NVIDIA NeMo Guardrails | Docling | Splunk HEC | LangFuse | Docker Compose | Traefik

Source: git.ozteklab.com/ozteklab/ragtronic

Assistant

Ozteklab Logo

Hi! How can I help you with Ozteklab documentation today?

Ask me anything about the documentation, code examples, or best practices.