Skip to content
· 9 min read

Self-Hosted AI — What It Actually Takes

By LumaVista Team

“Self-host your AI models” — three words that sound simple and aren’t. They conjure an image of running a quick install script, pointing your app at localhost, and calling it a day. The reality involves GPU procurement, VRAM math, inference engine tuning, and an operational burden that most teams underestimate by at least 3x.

That doesn’t mean self-hosting is wrong. For certain workloads — high-volume inference, sensitive data processing, or sovereignty requirements that APIs can’t satisfy — it’s the right call. But you should walk in with your eyes open. Here’s what it actually involves.

What “self-hosting” actually means

When people say “self-host AI,” they usually mean one of three things, and the differences matter:

Running inference on your own GPUs. You download an open-weight model, load it onto GPU hardware you control, and serve predictions through an API endpoint. This is the most common meaning and what we’ll focus on here.

Fine-tuning on your own hardware. You take a base model and train it further on your data. This requires significantly more compute than inference — typically 2-4x the VRAM — and adds a whole layer of ML engineering complexity on top.

Training from scratch. Unless you’re Meta or DeepSeek, this isn’t what you’re doing. Training a frontier model costs millions of dollars and requires thousands of GPUs running for months. We’ll skip this one.

Every billion parameters needs roughly 2 GB of VRAM at full precision. That single number drives every hardware decision in self-hosted AI.

For most organizations evaluating self-hosted AI, the question is straightforward: can we run inference on open-weight models cheaper, faster, or more privately than calling an API? The answer depends on your volume, your latency requirements, and how honest you are about operational costs.

The VRAM math you need to know

Here’s the single most important number in self-hosted AI: every billion parameters in a model needs roughly 2 GB of VRAM at FP16 precision. That’s the full-precision baseline. An 8-billion-parameter model like Llama 3.1 8B needs about 16 GB just for the weights, plus overhead for the KV cache and batch processing.

Quantization changes the equation. At INT4 precision — where each weight is compressed to 4 bits instead of 16 — that same model needs roughly 4 GB for weights. You lose some quality (typically 1-3% on standard benchmarks), but you gain the ability to run models on much cheaper hardware.

The formula is simple:

  • FP16 (full precision): parameters × 2 bytes = VRAM for weights
  • INT8 (half quantized): parameters × 1 byte = VRAM for weights
  • INT4 (aggressive quantization): parameters × 0.5 bytes = VRAM for weights

Then add 20-40% overhead for the KV cache, activations, and operating system needs. That overhead grows with your batch size — serving 32 concurrent requests needs more KV cache memory than serving one.

Here’s what this looks like in practice:

Model tierExample modelsGPU requirementVRAM neededApproximate monthly cost
7-14B parametersLlama 3.1 8B, Qwen 2.5 14B, Gemma 2 9B1× NVIDIA L4 24GB (INT4) or 1× L40S 48GB8-16 GB (INT4) / 16-32 GB (FP16)€200-600/mo (cloud GPU)
32-70B parametersLlama 3.1 70B, Qwen 2.5 72B, Qwen 2.5 32B2-4× A100 80GB or 1× H100 80GB35-70 GB (INT4) / 70-140 GB (FP16)€1,500-4,000/mo
400B+ MoE modelsLlama 4 Maverick, DeepSeek V3 (671B)4-8× H100 80GB (NVLink)200-350 GB (INT4) / 400-700 GB (FP16)€8,000-20,000/mo

The “approximate monthly cost” column is cloud GPU pricing from European providers like Scaleway and Hetzner as of March 2026 — we’ll get to specific vendors shortly. If you’re buying hardware outright, multiply by roughly 18-24 months to get the capital expenditure and then amortize.

VRAM usage breakdown across FP16, INT8, and INT4 quantization levels for model inference

Picking an inference engine

You’ve got the hardware sorted. Now you need software that actually loads the model and serves requests efficiently. The inference engine is what sits between your GPU and your application, and the choice matters more than most people expect.

vLLM is the current default for most production deployments. It implements PagedAttention — a memory management technique that dramatically improves throughput by avoiding VRAM waste from fragmented KV caches. Their team reported 2.7× throughput gains on Llama 70B with 4× H100s in recent optimizations. If you’re serving multiple concurrent users and want the best throughput-per-dollar, vLLM is probably where you start. It supports most popular model architectures and has an OpenAI-compatible API out of the box.

SGLang is the newer contender that’s been turning heads. It uses RadixAttention for prefix caching, which means repeated prompt prefixes (common in agentic workflows) are cached and reused rather than recomputed. SGLang is now deployed at scale by xAI, NVIDIA, AMD, Cursor, and others, generating trillions of tokens daily in production. In benchmarks with structured output and multi-turn conversations, SGLang often beats vLLM on latency. It’s worth evaluating if your workload involves a lot of back-and-forth with the same system prompt.

TensorRT-LLM is NVIDIA’s own offering. It compiles models into optimized CUDA kernels for maximum single-request performance. The downside: it’s tightly coupled to NVIDIA hardware, the build process is fiddly, and updates lag behind new model releases. If you’re running on NVIDIA GPUs (you probably are) and need the absolute lowest latency for a specific model you won’t change often, TensorRT-LLM can squeeze out extra performance — though the exact gain varies significantly by model and hardware configuration.

Ollama is the friendly option for getting started. It wraps llama.cpp in a Docker-like experience — ollama run llama3.1 and you’re serving inference. It’s great for development, prototyping, and small-scale personal use. It’s not what you want for production at scale. Ollama doesn’t support continuous batching, its throughput with concurrent users is significantly lower than vLLM or SGLang, and it lacks the observability hooks you need for serious operations.

The practical advice: start with vLLM. Benchmark SGLang against it with your actual workload. Only consider TensorRT-LLM if you’ve profiled a specific bottleneck that the other two can’t solve. Use Ollama for local development.

Four inference engines compared as vehicles suited for different workload types

Where to get GPUs in Europe

If you’re self-hosting for data sovereignty reasons, running on AWS or Azure defeats the purpose — even their “sovereign cloud” offerings don’t sever the CLOUD Act jurisdiction chain. You need European-owned providers.

Three stand out:

Scaleway (Paris-based, Iliad Group) offers NVIDIA H100, L40S, and L4 instances from their European data centers. As of March 2026, an L40S runs €1.40/hour (~€1,022/month) and an H100 PCIe starts at €2.73/hour (~€1,992/month). They have an AI-specific platform (Scaleway AI) with pre-configured inference stacks. French-owned, GDPR-native, no CLOUD Act exposure.

OVHcloud (Roubaix-based, publicly traded on Euronext Paris) is one of Europe’s largest cloud providers. They offer NVIDIA A100 and H100 GPU instances, plus a managed AI training platform. They’ve been investing heavily in AI infrastructure and their pricing is generally competitive with — and often below — equivalent hyperscaler offerings.

Hetzner (Gunzenhausen, Germany) is the budget option with serious hardware. Known for aggressive pricing on dedicated servers, their GPU-Line starts at €184/month for inference-optimized servers with NVIDIA RTX GPUs, going up to machines with 96 GB of VRAM for training workloads. The trade-off: less managed tooling, fewer high-level abstractions. If your team can handle the ops, Hetzner’s price-to-performance ratio is hard to beat.

All three are incorporated and headquartered in the EU, with no US parent company in the chain. That’s the line that matters.

The cost comparison everyone wants

Here’s where the self-hosting calculus gets real. Let’s compare API pricing against self-hosted costs at three usage levels, using a 70B parameter model (the sweet spot for most enterprise workloads) as the benchmark.

We’ll use current API pricing for models like Llama 3.1 70B on managed endpoints, versus the cost of running equivalent hardware on a European cloud provider. Self-hosted costs below are based on Scaleway H100 PCIe pricing as of March 2026 (~€2,000/month per GPU).

Usage levelTokens per dayAPI cost/month (hosted Llama 70B)Self-hosted cost/month (1-2× H100 80GB)Winner
Low (prototyping, internal tools)~1M tokens/day~€90/mo~€2,000/moAPI by a mile
Medium (production app, moderate traffic)~20M tokens/day~€1,800/mo~€2,000/moRoughly even — depends on ops cost
High (heavy production, multi-app)~200M tokens/day~€18,000/mo~€2,000-4,000/moSelf-hosted, if you can run it

Below 15 million tokens a day, you are paying a GPU premium to process air. Above it, you are saving real money — but only if you account for the hidden costs.

The breakeven point lands somewhere between 15-30 million tokens per day, depending on your model choice, quantization level, and how efficiently your inference engine batches requests. Below that, you’re paying a GPU premium to process air. Above it, you’re saving real money — but only if you account for the hidden costs.

Those hidden costs are significant: an ML engineer to maintain the stack (€80-120k/year), monitoring and alerting infrastructure, model update testing and rollout, failover and redundancy (double your GPU costs if you need high availability), and the opportunity cost of your engineering team not building product features.

The operational reality nobody talks about

Self-hosting AI isn’t a “set it and forget it” deployment. It’s a living system that needs care.

With an API, you handle a traffic spike by doing nothing. With self-hosted infrastructure, a traffic spike means either you have over-provisioned or your users are waiting in a queue.

Model updates are not automatic. When a better version of your model drops — and the pace of improvement is relentless — someone needs to download it, test it against your workloads, verify that quality hasn’t regressed on your specific use cases, update the quantization, roll it out with zero downtime, and keep the old version around for rollback. This is an ongoing process, not a one-time setup.

GPU failures happen. NVIDIA GPUs are reliable, but when you’re running multiple cards at high utilization 24/7, failures are a matter of when, not if. You need monitoring (GPU temperature, memory errors, inference latency), alerting, and either spare capacity or a fast failover plan. ECC memory errors in particular can cause silent quality degradation — your model starts giving subtly worse answers and nobody notices until a customer complains.

Scaling isn’t elastic. With an API, you handle a traffic spike by… doing nothing. The provider scales for you. With self-hosted infrastructure, a traffic spike means either you’ve over-provisioned (wasting money during normal load) or your users are waiting in a queue. Auto-scaling GPU instances takes minutes, not seconds, and you’re paying for those instances even while the model is loading.

Security is your problem. Model weights need to be stored securely. The inference API needs authentication, rate limiting, and input validation. The GPU servers need to be patched and hardened. None of this is rocket science, but it’s real work that someone on your team needs to own.

Self-hosted GPU infrastructure surrounded by operational burdens: updates, failures, scaling, security

The honest recommendation: hybrid

For most organizations, the right answer isn’t “self-host everything” or “use APIs for everything.” It’s a hybrid approach that matches the hosting model to the workload.

Self-host when: You’re processing more than 20-30 million tokens per day on a single model. You have hard sovereignty requirements that eliminate API providers. You need deterministic latency and can’t tolerate provider rate limits. You have ML engineering capacity on your team.

Use APIs when: You’re below the cost crossover point. You need access to multiple models and can’t afford to maintain separate GPU deployments for each. You need elastic scaling. Your team’s engineering capacity is better spent on product than infrastructure.

The hybrid sweet spot: Run your high-volume, sovereignty-sensitive workloads on self-hosted infrastructure. Route everything else through managed APIs. Use an abstraction layer that can switch between backends based on workload characteristics — cost, latency, data classification, model capability.

This is exactly the complexity that LumaVista handles for you. Rather than managing GPU servers, inference engines, model updates, and routing logic yourself, you define policies — “sensitive data stays on sovereign infrastructure, everything else optimizes for cost” — and the platform handles the rest. It’s the hybrid approach without the hybrid ops burden.

What to do now

  1. Audit your current AI usage. Count your tokens per day, per workload. If you don’t have this number, you can’t make the self-hosting decision rationally.

  2. Classify your data. Which workloads involve sensitive or regulated data? Those are your sovereignty candidates. Everything else can stay on APIs.

  3. Do the VRAM math. Pick the model you’d self-host, apply the formula above, and figure out what hardware you’d need. Don’t forget the 20-40% overhead.

  4. Price it out honestly. Include the GPU costs, the staffing, the redundancy, the monitoring tooling, and the engineering time. Compare against your actual API bill, not a hypothetical one.

  5. Run a pilot on a single workload. Don’t try to self-host everything at once. Pick your highest-volume, most predictable workload — usually something like document embedding or classification — and run it on a single GPU instance for a month. Measure everything.

  6. Evaluate managed hybrid platforms. If the pilot shows promise but the operational complexity gives you pause, look at platforms that abstract the infrastructure while preserving your sovereignty and cost advantages.

  7. Set a review cadence. The cost landscape changes quarterly. GPU prices drop, API prices drop, new models arrive. Whatever decision you make today should be re-evaluated every 90 days.