How it works

How we build and run your private AI.

A proven open stack, assembled and hardened on your hardware, then managed for you. No vendor lock-in, nothing leaving the building.

See pricing Get early access

The stack

Open infrastructure, fully managed.

Inference

vLLM or SGLang, serving the model on your local GPUs.

Gateway

LiteLLM, one OpenAI-compatible endpoint with keys and routing.

Retrieval

Postgres with pgvector, or Qdrant, for RAG over your private documents.

Interface

OpenWebUI, the chat and workspace your team uses.

Monitoring

Langfuse, traces, token usage, and latency on a local dashboard.

Compact tier

Ollama, for the single-GPU Core build.

The process

From spec to running system.

Spec

Size the hardware and model to the use case.

Deploy

Install the stack on-premise, air-gapped if needed.

Train

Fine-tune on your own documents and data.

Secure

Lock it down. Least privilege, nothing phoning home.

Run

Monitor, patch, and update under contract.

Models

The right model per tier.

A best-in-class default for each tier, and a Western model for clients who will not run foreign weights. The gap is small today, and on most work it is invisible.

Core

Default: Qwen 3.6-27B, compact with strong coding and reasoning. Western option: gpt-oss-120b, OpenAI, US, Apache 2.0.

Enterprise

Default: Qwen 3 235B-A22B, top open generalist at FP8. Western option: Cohere Command A+, Canada, built for enterprise RAG.

Sovereign

Default: GLM-5.2, 744B, leads open weights with a 1M context. Western option: Mistral Large 3, France, Apache 2.0, 675B.

Why not frontier Get early access