Private RAG: Answer Over Your Own Documents

Retrieval-augmented generation lets an agent read your contracts, reports, wikis and case files and answer questions with citations — and a private RAG stack keeps every document on hardware you control.

Why it should be private

Your most valuable knowledge is also your most sensitive: contracts, financials, case files, internal wikis. Sending it to a public API to get answers is exactly the wrong trade. Private RAG pairs a local embedding model with a local chat model so retrieval and generation both stay in-house.

Recommended models

Open models that fit this job, computed from our catalog.

DeepSeek-R1 671B (MoE)
DeepSeek · ~671B · runs on Supermicro 8x H100 SuperServer
Details →
Llama 3.1 405B
Llama · ~405B · runs on Supermicro 8x H100 SuperServer
Details →
Qwen3 235B-A22B (MoE)
Qwen · ~235B · runs on NVIDIA B200 (placeholder)
Details →
Qwen2.5 72B
Qwen · ~72B · runs on NVIDIA B200 (placeholder)
Details →
Llama 3.1 70B
Llama · ~70B · runs on NVIDIA B200 (placeholder)
Details →

Recommended hardware

Machines that suit this deployment, strongest first.

The Legal / DocMatch pack

A confidential evidence and document agent for legal teams.

What it does

▸Evidence and exhibit search with cited passages
▸Contract and clause Q&A across matters
▸Discovery review and summarization
▸Privileged-material assistants that never leave the office

Connects to

Document storesEmailGoogle WorkspaceCase management

Connectors are how the agent does real work — see why hardware alone isn’t enough.

Deployment options

Local appliance

A quiet box on-site running your agents. Lowest cost per request and full data residency for a single office or property.

Best for: SMBs, single sites, confidential data, predictable everyday workloads.

On-prem server

A workstation or server in your rack or closet, serving many agents and larger models to a whole team or department.

Best for: Departments, regulated data, high steady volume, multi-agent platforms.

Cloud GPU

Rented GPUs in your own cloud account for bursts, the largest models, or before you've validated volume — no hardware to own.

Best for: Spiky demand, frontier models, pilots, overflow capacity.

Hybrid

Everyday private agents run locally; heavy or occasional jobs burst to the cloud. The pragmatic default for most businesses.

Best for: Most real deployments — control and cost locally, elasticity in the cloud.

Frequently asked questions

What do I need to run private RAG?+

Two models: a small embedding model (e.g. nomic-embed-text) for retrieval and a capable chat model (e.g. Qwen2.5 14–32B) for answering. Both run on a single 16–24GB GPU for most document sets.

How is this different from a normal chatbot?+

RAG retrieves the most relevant passages from your documents and gives them to the model, so answers are grounded in your data with citations — not the model's training data.

Can everything stay on-premise?+

Yes. Embeddings, the vector index and the chat model all run on your hardware, so no document content leaves your infrastructure.

Run Private RAG: Answer Over Your Own Documents as a private AI Business OS

Run your own AI agents on hardware you control — private by design, no per-seat data leaving your premises. BrainOutput helps you pick the right machine and turn it into a working AI Business OS.

Get started