Back to VP Lab

Private AI · how it actually works

The shape of a real private AI engagement.

No marketing fluff. This page walks through the architecture, the stack, the timeline and the honest cost of deploying the same class of AI you tried on VP Lab — except running entirely on your own hardware.

01 · Architecture

Seven layers, your hardware, your network.

Each layer below sits entirely inside your infrastructure — there’s no point in the stack where data hops to a third-party API.

07

Your team / integrations

Staff access the system via a web UI, chat interface, or direct API calls from existing tools (Outlook, Slack, your CRM).

06

Access & identity

Zero-Trust access, SSO with your existing identity provider (Azure AD / Entra / Google Workspace / Okta). No inbound ports exposed to the internet.

05

Application layer

Next.js / Node / Python apps tailored to your workflows. Document Q&A, contract review, meeting-notes assistant — whichever VP Lab demos map to your day-to-day.

04

RAG & data layer

Vector store (Chroma, Qdrant or Postgres with pgvector). Ingests your documents, embeds them locally, retrieves relevant context at query time. No external embedding API.

03

LLM runtime

Open-weight models (Llama 3.3, Mistral, Qwen, DeepSeek) served via Ollama, vLLM or llama.cpp. Quantised where sensible to fit your hardware.

02

Container layer

Docker + docker-compose for small deployments; Kubernetes for larger ones. Reproducible builds, version-pinned, easy rollback.

01

Your hardware

Anywhere from a single GPU workstation in your server room to a rack of H100s. Spec is sized to the models you need, not picked from a catalogue.

02 · Timeline

4 to 8 weeks. Here’s where the time goes.

  1. Weeks 1–2
    Discovery & spec

    Workshop your use-cases, pick the model(s) that fit, spec the hardware accordingly. Written proposal, fixed fee, no surprises.

  2. Weeks 2–4
    Infrastructure

    Hardware install, OS & Docker, model runtime, security hardening, internal networking. Your team sees progress in your own environment.

  3. Weeks 4–6
    Integration

    Ingest your documents into the RAG layer, wire the app to your SSO, connect to the tools your team actually uses (Outlook, SharePoint, Slack, CRM).

  4. Weeks 6–8
    Handover

    Training workshops, runbooks, full documentation. You walk away able to run and swap models yourself — no lock-in.

Smaller deployments (a single team, one use-case) can land in 4 weeks. Larger ones (multi-department, complex integrations) lean toward 8+.

03 · Cost

Honest ballparks — not cat-and-mouse quoting.

Exact pricing depends on your model size, data volume, and integrations. These ranges cover the typical London SMB engagement.

Hardware (one-off)
£3,000 – £15,000

A single GPU workstation for a 7B–13B model runs ~£3–5k. A small server with an H100 for 70B–class models climbs to £10–15k. Bring your own hardware to drop this to zero.

Setup & integration (project)
£5,000 – £20,000

Simple document-Q&A on your files: lower end. Multi-team deployment with SSO, custom app, and ingestion pipelines: higher end. Fixed-fee; no hourly.

Ongoing retainer (optional)
£600 – £2,000 / month

Model updates, RAG re-ingestion, performance tuning, on-call. Optional — you can absolutely run it yourselves after handover.

Running cost (per month)
£0 in API fees

The point of private AI. No per-token billing, no usage caps, no surprise invoice if your team hammers the system. Electricity → your existing overheads.

04 · Deliverables

What you actually own at handover.

  • A running AI system on your hardware, your network, your control.
  • Source code and configuration in a Git repo you own (not a black box).
  • Written architecture & operations documentation — plain English, not jargon.
  • Runbooks for model swaps, updates, backup/restore and incident response.
  • Staff-training sessions so your team actually uses it (not shelfware).
  • An honest review of what this system can and can’t do for you.

05 · The usual questions

Straight answers to the things everyone asks.

Is this less capable than ChatGPT / Claude?+

For general-knowledge trivia, slightly. For your own documents, procedures and terminology — which is what you actually care about — no. An open-weight 70B model fine-tuned on your corpus out-performs ChatGPT out-of-the-box for internal tasks almost every time.

What happens when a better model is released next month?+

You swap it. That’s a day’s work, not a migration. Hardware and integration stay the same — only the model artefact changes. Open weights mean you never chase a proprietary vendor’s roadmap.

What about scaling if the team grows?+

Add more GPUs for parallel inference, or add nodes. The stack is built to horizontally scale from day one — you’re not re-architecting later.

We don’t have a server room. Can we still do this?+

Yes. A private AI can live in a colocation facility, a UK-based sovereign cloud, or a secure server room we source. The "private" bit is legal/data-control, not literally about your building.

Can we just use Microsoft Copilot / Google’s version?+

You can, and sometimes that’s the right call. The trade-off: their terms, their data-residency choices, their pricing roadmap. If compliance, confidentiality, or cost-predictability is load-bearing for you, private deployment is the cleaner answer.

Ready to scope yours?

Book a free 20-minute call. Bring a rough idea of your use-case; walk away with a sensible spec, a budget range, and a realistic timeline — whether or not you end up working with me.