Why We Built Our Own LLM Gateway

Picture this: your AI agent is deep into a research task — synthesizing legal documents, cross-referencing case law, pulling together a report for a client. Every request it makes flows through a third-party inference proxy. That proxy is a Python service maintained by a startup you’ve never met, running code you haven’t audited, hosted on infrastructure subject to the US CLOUD Act. Every prompt, every piece of context your agent sends, passes through someone else’s code before it ever reaches a GPU.

That’s not a hypothetical. That was our setup six months ago. And if you’re running AI workloads through any managed proxy service — LiteLLM, OpenRouter, or similar — it’s probably yours too.

We’ve written before about why the CLOUD Act matters for AI research. The short version: US law compels any US-headquartered company to hand over customer data on demand, regardless of where that data is physically stored. Running your inference through a US proxy means every prompt is legally accessible to US authorities. For European users handling sensitive data — legal, medical, financial — that’s not a compliance footnote. It’s a showstopper.

But jurisdiction was only one piece of the puzzle. The operational problems were just as real: no way to prioritize interactive requests over batch jobs, no auto-scaling when demand spiked, and a configuration model that required service restarts for routing changes. So we built Meridian.

What we tried first

Let’s give credit where it’s due. LiteLLM got several things right. It abstracts away the differences between model providers behind an OpenAI-compatible API. You don’t need to rewrite your code when you switch from one model to another. The community is active, the documentation is decent, and for simple setups it works out of the box.

Here’s where it fell short for us:

No priority queuing. When a user is waiting for a real-time chat response and a background batch job fires off fifty requests simultaneously, LiteLLM treats them all equally. First come, first served. Your interactive user stares at a spinner while batch jobs monopolize the GPU.

No auto-scaling. You provision a fixed number of instances. When traffic spikes, you’re either over-provisioned (wasting money) or under-provisioned (dropping requests). Scaling up means manual intervention — spinning new instances, updating configs, restarting services.

Third-party Python in the data path. Every single request flows through Python code you don’t control. You’re trusting that the proxy doesn’t log prompts, doesn’t send telemetry, doesn’t have vulnerabilities. For privacy-sensitive workloads, “trust but don’t verify” isn’t a strategy.

Config requires restarts. Want to add a new model backend? Change a routing rule? Update rate limits? Restart the service. In a production environment serving real users, that’s a non-starter.

None of these are bugs — they’re design choices that make sense for LiteLLM’s target audience. But they didn’t work for ours.

Every prompt flowing through a third-party proxy you haven’t audited is a privacy liability you’ve chosen to accept by default.

Why a library, not a microservice

Here’s the decision that shaped everything else: Meridian is a Go module, not a standalone service.

Most inference gateways are deployed as separate processes. Your application talks to the gateway over HTTP, the gateway talks to GPUs over HTTP, and you’ve got two network hops, serialization overhead, and another service to monitor. For our primary use case — LVDR’s agent engine making hundreds of LLM calls per research task — that overhead adds up fast.

By building Meridian as an importable library, LVDR calls it in-process. Zero network hops. No JSON serialization round-trips. No extra container to deploy. The gateway is just a function call. When we talk about how agentic AI systems work, this is the kind of tight integration that makes agents fast — the gateway doesn’t sit between the agent and the model, it’s part of the agent’s runtime.

But we didn’t want a single-purpose library. Meridian supports three deployment modes:

Embedded library — what LVDR uses. Import the Go module, call it in-process, zero overhead.

Standalone SaaS — multiple gateway instances behind a load balancer, multi-tenant, OpenAI-compatible API. For customers who want managed inference without running their own infrastructure.

B2B on-prem — a single Docker image containing the gateway binary, an embedded admin dashboard, and Prometheus metrics. Customers point it at their own GPUs.

The trade-off was real. We gave up LiteLLM’s 100+ provider integrations, its community plugins, and the entire Python ecosystem. What we gained: in-process calls with zero serialization overhead, compile-time type safety, a single binary deployment, and complete control over every byte that flows through the system.

A tangled chain of faintly glowing amber links contrasted with a single clean golden arc — the direct path pulses with confident light

Capability-based routing

Here’s the core idea: agents shouldn’t know which model they’re talking to. They should declare what they need, not what they want.

Think of it like Kubernetes label selectors. A pod doesn’t say “schedule me on node-07.” It says “I need 4 CPUs and 16GB of RAM.” The scheduler figures out where it fits. Meridian works the same way for inference.

An agent making a request specifies capabilities — things like reasoning, fast, long-context, or code — along with minimum context window requirements and a priority tier. Here’s what the actual request struct looks like:

req := meridian.Request{
    Messages:     messages,
    Capabilities: []string{"reasoning"},   // required
    Prefer:       []string{"fast"},        // nice to have
    MinContext:    16384,                   // minimum context window
    Tier:         meridian.TierNormal,
    TenantID:     "tenant-xyz",
}

The gateway takes it from there with a four-step backend selection algorithm:

Filter — eliminate backends that can’t serve the request. Wrong capabilities? Too small a context window? At capacity? Gone.
Score — rank remaining backends by capability match, preferred trait bonuses, current load, latency, locality, and cost.
Select — highest score wins, with random tie-breaking.
Fallback — if nothing matches, either stay in the queue (back-pressure) or reject with a clear error explaining what’s missing.

In LVDR, each agent type maps to a specific capability profile:

Agent	Capabilities	Priority Tier
Planner	reasoning, instruction	normal
Searcher	fast, instruction	normal
Aggregator	reasoning, long-context	normal
Report Writer	instruction, long-context	normal
RAG	reasoning, instruction	normal
Reranker	fast	low
Interactive Chat	reasoning, instruction	critical

The beauty of this approach? When a faster model becomes available, or we switch from Llama to Qwen for a particular capability, no application code changes. The routing layer handles it. We can upgrade models, swap providers, or add new backends without touching a single agent.

Agents should declare what they need, not what they want. Capability routing decouples application logic from model selection entirely.

Glowing amber orbs with distinct internal patterns connected by delicate golden threads to matching structured containers — capability-based routing in action

Priority queuing with batch soaking

Not all requests are created equal. A user typing in a chat box needs a response in seconds. A background job reprocessing yesterday’s research can wait minutes. LiteLLM treats them identically. Meridian doesn’t.

Three priority tiers — critical, normal, and low — plus a special parking queue for batch work. Each tier gets a base weight, and lower-tier requests age upward over time to prevent starvation (where batch jobs starve because interactive requests keep cutting in line):

effective_priority = base_weight + (age_seconds × aging_factor)

Tier       base_weight    aging_factor
critical   1000           0             (always highest)
normal     100            2.0           (matches critical after ~450s)
low        10             1.0           (matches critical after ~990s)
parking    0              0.5           (very slow — best-effort)

The dispatcher checks all queue heads, scores them by effective priority, and routes the winner. Critical requests go first — always. But a normal request that’s been waiting seven minutes will eventually jump ahead of a fresh critical one. That’s the fairness guarantee.

Now for the clever part: batch soaking. When a burst GPU instance finishes its primary workload and enters cooldown — the load has dropped, but the billing hour hasn’t expired yet — the scaler marks it as “soaking.” During soaking, the dispatcher routes parking queue items (batch jobs) to that GPU. You’ve already paid for the hour. Those batch tokens cost you €0 in marginal compute.

Here’s the flow:

Incoming Request
  → Classify tier (agent type, endpoint, explicit header)
  → Tenant concurrency check
      → Over limit? → Reject 429
  → Enqueue to priority tier
  → Dispatcher pops highest effective_priority
      → Priority: critical > normal > low > parking
  → Route to backend (capability match + load scoring)
  → Response back to caller

When a burst instance enters cooldown:

Scaler marks the backend as mode: soaking
Dispatcher routes only parking queue items to soaking backends
At billing hour expiry, the scaler terminates the instance
In-flight parking requests get re-queued — batch work, so latency doesn’t matter

Three parallel streams of warm light — bright golden particles rushing forward, steady amber flow, and gentle copper particles drifting — all converging into a luminous structure

Auto-scaling across EU providers

Static GPU provisioning is either wasteful or fragile. Too many instances and you’re bleeding money on idle hardware. Too few and your queue backs up during peak hours. Meridian’s auto-scaler handles this dynamically.

The scaler runs as a leader-elected singleton — only one instance across the entire cluster makes scaling decisions, coordinated through Redis. It collects three signals every ten seconds:

Queue pressure — how deep is each priority queue? How old is the oldest request? How fast are new requests arriving?

Capacity shortage — which capabilities are starved? How many requests are unroutable because no backend matches?

Fleet utilization — per-instance GPU utilization, memory usage, inflight request ratios, aggregated into fleet-wide averages.

Scale-up triggers when any of these fire: critical queue deeper than 5 requests for more than 15 seconds, fleet utilization above 85% for two minutes, or the router can’t find backends for a capability that’s in demand. Scale-down kicks in when a burst instance has been under 20% utilization for five minutes and the queues are clear.

The scaler talks to GPU providers through a clean abstraction:

type InstanceProvider interface {
    Provision(ctx context.Context, spec InstanceSpec) (*Instance, error)
    Terminate(ctx context.Context, instanceID string) error
    List(ctx context.Context) ([]*Instance, error)
    Status(ctx context.Context, instanceID string) (*InstanceStatus, error)
}

Four EU providers implement that interface today: Hetzner, OVHcloud, Scaleway, and Genesis Cloud. We’ve written about why EU-based providers matter — the short version is that hosting on “EU region” from a US provider still leaves you exposed to the CLOUD Act. Only providers headquartered in the EU and operating under EU law give you genuine data sovereignty.

Budget guards prevent runaway costs. You set maximum instances, maximum cost per hour, maximum cost per day, and a threshold above which the scaler pauses and asks for human approval via webhook before provisioning more hardware.

The scaler also does predictive pre-warming. It builds demand profiles per hour of the week over rolling 7-day windows. If it sees that demand consistently spikes at 9 AM on Mondays, it’ll start provisioning instances 15 minutes early. No cold-start delays for your users.

Here’s what the provider landscape looks like for EU-sovereign GPU inference:

Hetzner (Germany) — RTX PRO 6000 with 96GB VRAM at €889/month. Best value for steady-state workloads. Runs Llama 70B at ~32 tokens/second per user.
OVHcloud (France) — single H100/H200 GPUs with hourly billing. Higher throughput, roughly 2.5x faster than Hetzner for the same model.
Scaleway (France) — H100 GPUs, pay-per-hour with no commitment. Good for variable workloads.
Genesis Cloud (Nordic/EU) — 8-GPU H200 nodes for large models like DeepSeek-R1. The only EU option that fits R1 at usable precision on a single node.

All EU-headquartered, all operating under EU law, zero US CLOUD Act exposure.

Predictive pre-warming builds demand profiles per hour of the week — if Monday 9 AM always spikes, instances start provisioning 15 minutes early.

What we gained

Six months of building Meridian instead of patching around LiteLLM’s limitations. Was it worth it? Here’s what changed:

Latency control. Interactive users don’t compete with batch jobs anymore. Critical requests get priority. Average time-to-first-token dropped because chat requests aren’t stuck behind a queue of background reprocessing jobs.

Cost visibility. Per-tenant billing tracks exactly how many tokens each user consumes, on which models, at what cost. Per-model cost tracking shows us where money actually goes. The batch soaking pattern recovers value from GPU time we’ve already paid for — parking queue work runs at €0 marginal cost during billing hour cooldowns.

Privacy guarantees. Zero third-party code in the request path. No Python proxy we didn’t write. No telemetry we didn’t configure. Every prompt stays on infrastructure we control, hosted by EU-headquartered providers under EU law. We’ve talked about how open-source models caught up with proprietary ones — Meridian is what lets us actually use them without compromising on privacy.

Operational observability. Every gateway instance exports Prometheus metrics: queue depths, latency percentiles, GPU utilization, token throughput, cost rates. The embedded admin dashboard shows real-time fleet status — which instances are healthy, which are draining, which are soaking batch work. Alert rules fire webhooks to Slack or PagerDuty when things go sideways.

Deployment flexibility. The same codebase runs embedded in LVDR (zero overhead), as a standalone multi-tenant SaaS, or as a single Docker image for B2B on-prem customers. One module, three deployment modes, no code forks.

What to do now

If you’re evaluating your own inference stack, here’s where to start:

Audit your current proxy. What code sits between your application and the GPU? Who wrote it? What telemetry does it send? Read the source — if you can’t, that’s your answer.
Check your provider’s jurisdiction. “EU region” on a US cloud provider doesn’t protect you from the CLOUD Act. Verify that your provider is headquartered in the EU and not a subsidiary of a US parent company.
Separate interactive and batch traffic. Even without rebuilding your proxy, you can run two instances — one for real-time, one for background work. It’s a band-aid, but it stops the worst latency spikes.
Map your agents to capabilities, not models. Start abstracting model names out of your application code now. When you switch models later (and you will), you’ll thank yourself.
Measure your GPU utilization. If your inference GPU sits below 30% utilization most of the day, you’re paying for idle hardware. Consider hourly billing or burst-based scaling.
Evaluate EU providers. Hetzner’s GEX131 at €889/month runs Llama 70B with 96GB of VRAM. That’s enough for a small team. OVHcloud and Scaleway offer single-GPU hourly billing if you need more flexibility.
Calculate your cost-per-token. Divide your monthly GPU cost by your actual token throughput. Compare that to API pricing from OpenAI or Anthropic. For high-volume workloads, self-hosted inference on EU hardware often costs less than €1.10 per million tokens.
Start with one backend, then grow. You don’t need a full fleet on day one. Run a single GPU with capability routing, validate the integration, then add auto-scaling and burst providers as your traffic justifies it.