Private RAG: Answer Over Your Own Documents
Retrieval-augmented generation lets an agent read your contracts, reports, wikis and case files and answer questions with citations — and a private RAG stack keeps every document on hardware you control.
Why it should be private
Your most valuable knowledge is also your most sensitive: contracts, financials, case files, internal wikis. Sending it to a public API to get answers is exactly the wrong trade. Private RAG pairs a local embedding model with a local chat model so retrieval and generation both stay in-house.
Recommended models
Open models that fit this job, computed from our catalog.
- DeepSeek-R1 671B (MoE)Details →DeepSeek · ~671B · runs on Supermicro 8x H100 SuperServer
- Llama 3.1 405BDetails →Llama · ~405B · runs on Supermicro 8x H100 SuperServer
- Qwen3 235B-A22B (MoE)Details →Qwen · ~235B · runs on NVIDIA B200 (placeholder)
- Qwen2.5 72BDetails →Qwen · ~72B · runs on NVIDIA B200 (placeholder)
- Llama 3.1 70BDetails →Llama · ~70B · runs on NVIDIA B200 (placeholder)
Recommended hardware
Machines that suit this deployment, strongest first.
The Legal / DocMatch pack
A confidential evidence and document agent for legal teams.
What it does
- ▸Evidence and exhibit search with cited passages
- ▸Contract and clause Q&A across matters
- ▸Discovery review and summarization
- ▸Privileged-material assistants that never leave the office
Connects to
Connectors are how the agent does real work — see why hardware alone isn’t enough.
Deployment options
Local appliance
A quiet box on-site running your agents. Lowest cost per request and full data residency for a single office or property.
Best for: SMBs, single sites, confidential data, predictable everyday workloads.
On-prem server
A workstation or server in your rack or closet, serving many agents and larger models to a whole team or department.
Best for: Departments, regulated data, high steady volume, multi-agent platforms.
Cloud GPU
Rented GPUs in your own cloud account for bursts, the largest models, or before you've validated volume — no hardware to own.
Best for: Spiky demand, frontier models, pilots, overflow capacity.
Hybrid
Everyday private agents run locally; heavy or occasional jobs burst to the cloud. The pragmatic default for most businesses.
Best for: Most real deployments — control and cost locally, elasticity in the cloud.
Frequently asked questions
What do I need to run private RAG?+
Two models: a small embedding model (e.g. nomic-embed-text) for retrieval and a capable chat model (e.g. Qwen2.5 14–32B) for answering. Both run on a single 16–24GB GPU for most document sets.
How is this different from a normal chatbot?+
RAG retrieves the most relevant passages from your documents and gives them to the model, so answers are grounded in your data with citations — not the model's training data.
Can everything stay on-premise?+
Yes. Embeddings, the vector index and the chat model all run on your hardware, so no document content leaves your infrastructure.
Run Private RAG: Answer Over Your Own Documents as a private AI Business OS
Run your own AI agents on hardware you control — private by design, no per-seat data leaving your premises. BrainOutput helps you pick the right machine and turn it into a working AI Business OS.
Get started