Your team / integrations
Staff access the system via a web UI, chat interface, or direct API calls from existing tools (Outlook, Slack, your CRM).
Private AI · how it actually works
No marketing fluff. This page walks through the architecture, the stack, the timeline and the honest cost of deploying the same class of AI you tried on VP Lab — except running entirely on your own hardware.
01 · Architecture
Each layer below sits entirely inside your infrastructure — there’s no point in the stack where data hops to a third-party API.
Staff access the system via a web UI, chat interface, or direct API calls from existing tools (Outlook, Slack, your CRM).
Zero-Trust access, SSO with your existing identity provider (Azure AD / Entra / Google Workspace / Okta). No inbound ports exposed to the internet.
Next.js / Node / Python apps tailored to your workflows. Document Q&A, contract review, meeting-notes assistant — whichever VP Lab demos map to your day-to-day.
Vector store (Chroma, Qdrant or Postgres with pgvector). Ingests your documents, embeds them locally, retrieves relevant context at query time. No external embedding API.
Open-weight models (Llama 3.3, Mistral, Qwen, DeepSeek) served via Ollama, vLLM or llama.cpp. Quantised where sensible to fit your hardware.
Docker + docker-compose for small deployments; Kubernetes for larger ones. Reproducible builds, version-pinned, easy rollback.
Anywhere from a single GPU workstation in your server room to a rack of H100s. Spec is sized to the models you need, not picked from a catalogue.
03 · Cost
Exact pricing depends on your model size, data volume, and integrations. These ranges cover the typical London SMB engagement.
A single GPU workstation for a 7B–13B model runs ~£3–5k. A small server with an H100 for 70B–class models climbs to £10–15k. Bring your own hardware to drop this to zero.
Simple document-Q&A on your files: lower end. Multi-team deployment with SSO, custom app, and ingestion pipelines: higher end. Fixed-fee; no hourly.
Model updates, RAG re-ingestion, performance tuning, on-call. Optional — you can absolutely run it yourselves after handover.
The point of private AI. No per-token billing, no usage caps, no surprise invoice if your team hammers the system. Electricity → your existing overheads.
05 · The usual questions
For general-knowledge trivia, slightly. For your own documents, procedures and terminology — which is what you actually care about — no. An open-weight 70B model fine-tuned on your corpus out-performs ChatGPT out-of-the-box for internal tasks almost every time.
You swap it. That’s a day’s work, not a migration. Hardware and integration stay the same — only the model artefact changes. Open weights mean you never chase a proprietary vendor’s roadmap.
Add more GPUs for parallel inference, or add nodes. The stack is built to horizontally scale from day one — you’re not re-architecting later.
Yes. A private AI can live in a colocation facility, a UK-based sovereign cloud, or a secure server room we source. The "private" bit is legal/data-control, not literally about your building.
You can, and sometimes that’s the right call. The trade-off: their terms, their data-residency choices, their pricing roadmap. If compliance, confidentiality, or cost-predictability is load-bearing for you, private deployment is the cleaner answer.
Book a free 20-minute call. Bring a rough idea of your use-case; walk away with a sensible spec, a budget range, and a realistic timeline — whether or not you end up working with me.