Introduction

LLM Gateway is an intelligent routing layer that sits between your application and multiple LLM providers. It automatically selects the best model for each request based on cost, latency, reliability, and your preferences.

Cost Optimization

Save 30-50% on LLM costs with intelligent model selection

Automatic Failover

Never experience downtime with multi-provider fallbacks

Full Observability

Execution receipts for every request with routing decisions

Drop-in Compatible

Works with OpenAI SDK, LangChain, and any OpenAI-compatible client

Quick Start

Get up and running in under 5 minutes. Here's everything you need.

1. Get your API key

Sign up and create an API key from the dashboard.

2. Make your first request

Use cURL or any HTTP client to make a request:

First API Callbash
curl https://api.llmroute.xyz/v1/chat/completions \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Hello, world!"}
    ]
  }'

3. Or use the OpenAI SDK

Just change the base URL—everything else stays the same:

Pythonpython
from openai import OpenAI

client = OpenAI(
    api_key="your-llm-gateway-api-key",
    base_url="https://api.llmroute.xyz/v1"
)

response = client.chat.completions.create(
    model="auto",  # Let the gateway choose the best model
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ]
)

print(response.choices[0].message.content)
Node.jsjavascript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'your-llm-gateway-api-key',
  baseURL: 'https://api.llmroute.xyz/v1',
});

const response = await client.chat.completions.create({
  model: 'auto',
  messages: [
    { role: 'user', content: 'Hello, world!' }
  ],
});

console.log(response.choices[0].message.content);

Authentication

All API requests require authentication using a Bearer token.

Authorization Headerhttp
Authorization: Bearer llm_gw_xxxxxxxxxxxxxxxxxxxx

API Key Types

TypePrefixUse Case
Productionllm_gw_prod_Live production traffic
Developmentllm_gw_dev_Testing and development

Security Best Practices

  • • Never expose API keys in client-side code
  • • Use environment variables to store keys
  • • Rotate keys periodically
  • • Use different keys for development and production

Auto-Routing

Auto-routing is the core feature of LLM Gateway. When you set model: "auto", the gateway evaluates multiple factors to select the optimal model.

How It Works

1

Request Analysis

We analyze your request including prompt complexity, expected output length, and any constraints you specify.

2

Provider Evaluation

We check real-time availability, latency, and cost across all enabled providers.

3

Model Selection

Using your preferences and our optimization algorithms, we select the best model.

4

Execution & Fallback

We execute the request with automatic fallback if the primary choice fails.

Routing Hints

You can provide hints to influence routing decisions:

Routing Hintsjson
{
  "model": "auto",
  "messages": [...],
  "x-routing-hints": {
    "priority": "cost",        // "cost" | "latency" | "quality"
    "max_latency_ms": 2000,    // Maximum acceptable latency
    "prefer_providers": ["anthropic", "openai"],
    "exclude_providers": ["cohere"],
    "min_quality_score": 0.8
  }
}

Providers & Models

LLM Gateway supports all major LLM providers. You can use auto-routing or specify a model directly.

Supported Providers

ProviderModelsFeatures
OpenAIGPT-4o, GPT-4 Turbo, GPT-3.5 TurboFunctions, Vision, JSON mode
AnthropicClaude 3.5 Sonnet, Claude 3 Opus, Claude 3 HaikuLong context, Tool use
GoogleGemini Pro, Gemini UltraMultimodal, Long context
MistralMistral Large, Mistral Medium, Mistral SmallFast inference, Cost-effective
CohereCommand R+, Command RRAG optimized, Multilingual

Direct Model Access

You can also request a specific model directly:

Specific Modeljson
{
  "model": "openai/gpt-4o",
  "messages": [...]
}

// Or with provider prefix
{
  "model": "anthropic/claude-3-5-sonnet-20241022",
  "messages": [...]
}

Fallbacks & Reliability

LLM Gateway automatically handles failures with intelligent fallback strategies.

Automatic Fallback

When a provider fails, we automatically retry with the next best option:

Primary fails
Retry secondary
Success

Circuit Breaker

We implement circuit breakers to prevent cascading failures:

  • Closed: Normal operation, requests pass through
  • Open: Provider temporarily disabled after repeated failures
  • Half-Open: Testing provider recovery with limited traffic

Custom Fallback Chain

Custom Fallbackjson
{
  "model": "auto",
  "messages": [...],
  "x-fallback-chain": [
    "anthropic/claude-3-5-sonnet",
    "openai/gpt-4o",
    "mistral/mistral-large"
  ]
}

Cost Optimization

LLM Gateway helps you save 30-50% on LLM costs through intelligent routing and real-time cost analysis.

How We Optimize Costs

Smart Model Selection

We match request complexity to the most cost-effective model that meets quality requirements.

Real-time Pricing

We track pricing changes across providers and adjust routing instantly.

Spend Limits

Set daily, weekly, or monthly spend limits to prevent unexpected charges.

Cost Alerts

Get notified when spending exceeds thresholds or anomalies are detected.

Setting Spend Limits

Spend Limits (Dashboard or API)json
{
  "limits": {
    "daily_usd": 100,
    "monthly_usd": 2500,
    "per_request_usd": 0.50
  },
  "alerts": {
    "threshold_percent": 80,
    "webhook_url": "https://your-app.com/webhook/spend-alert"
  }
}

Semantic Cache

Semantic caching reduces costs by up to 90% for repeated or similar queries. Unlike exact-match caching, semantic cache understands the meaning of your prompts and returns cached responses for semantically similar requests.

💰 Real Savings Example

A customer support chatbot handling 10,000 queries/day saw 73% cache hits, reducing their monthly LLM costs from $3,200 to $864—saving over $2,300/month.

How It Works

1

Embedding Generation

Your prompt is converted to a vector embedding that captures its semantic meaning.

2

Similarity Search

We search for cached prompts with similar embeddings using vector similarity (cosine distance).

3

Threshold Check

If similarity exceeds your threshold (default 95%), the cached response is returned instantly.

4

Cache Miss

If no match is found, the request goes to the LLM and the response is cached for future use.

Example: Similar Queries Hit Cache

Query 1: "What is the capital of France?"

→ LLM call, response cached

Query 2: "What's France's capital city?"

→ Cache hit! 98% similarity, instant response

Query 3: "Tell me the capital of France"

→ Cache hit! 96% similarity, instant response

Configuration

Cache Settings (per organization)json
{
  "cache_enabled": true,
  "similarity_threshold": 0.95,  // 0.90-0.99, higher = stricter matching
  "ttl_seconds": 3600,           // Cache expiration (1 hour default)
  "exclude_models": []           // Models to never cache
}

Cache is isolated per organization—your cached responses are never shared with other customers.

Provider Connections

Connect your own API keys from LLM providers to get direct pricing and use your existing quotas. LLM Gateway manages routing, fallbacks, and observability while you maintain your provider relationships.

Supported Providers

OpenAI

GPT-4o, GPT-4 Turbo, GPT-3.5

Anthropic

Claude 3.5 Sonnet, Claude 3 Opus

Google AI

Gemini Pro, Gemini Ultra

Mistral

Mistral Large, Medium, Small

Groq

Llama 3, Mixtral (ultra-fast)

Together AI

Open source models

Cohere

Command R+, Command R

AWS Bedrock

Claude, Llama, Titan

Benefits of BYOK (Bring Your Own Keys)

  • Direct pricing: Pay provider rates directly, no markup on API costs
  • Use existing quotas: Leverage your negotiated rate limits and credits
  • Enterprise agreements: Maintain your BAAs and compliance contracts
  • Gradual migration: Test LLM Gateway without changing billing

How to Connect a Provider

  1. Navigate to Connections in the dashboard
  2. Click Add Connection and select your provider
  3. Enter your API key (encrypted with AES-256)
  4. Test the connection to verify it works
  5. Enable the provider for routing

Security Note

Your API keys are encrypted at rest using AES-256 and envelope encryption. Keys are only decrypted in memory when making requests to providers.

AI Recommendations

LLM Gateway analyzes your usage patterns and provides personalized recommendations to optimize costs, improve performance, and enhance reliability.

Types of Recommendations

Cost Savings

Identify cheaper models that provide equivalent quality for your use cases.

"Switch from GPT-4 to Claude 3 Haiku for simple Q&A—save 85% with similar quality."

Performance Optimization

Find faster models or providers with lower latency for your region.

"Route to Groq for latency-sensitive requests—get 5x faster responses."

Reliability Improvements

Add fallback providers or adjust retry strategies based on failure patterns.

"Add Anthropic as fallback—reduce failed requests by 94%."

Usage Insights

Understand your traffic patterns and optimize accordingly.

"Enable semantic cache—73% of your queries are semantically similar."

How Recommendations Work

  • We analyze your last 30 days of usage data
  • Recommendations are generated weekly or on-demand
  • Each recommendation includes estimated savings/impact
  • Apply recommendations with one click in the dashboard
  • Track the impact after applying changes

GPU Compute

Run open-source models on dedicated GPU infrastructure for maximum control, privacy, and cost efficiency at scale.

🚀 When to Use GPU Compute

  • • High volume: 100K+ requests/day where per-token costs add up
  • • Data privacy: Keep all data on your own infrastructure
  • • Custom models: Run fine-tuned or specialized models
  • • Predictable costs: Fixed GPU pricing vs variable per-token

Available GPU Options

GPUVRAMBest ForPrice/hr
NVIDIA A10G24GBLlama 3 8B, Mistral 7B$1.20
NVIDIA A100 40GB40GBLlama 3 70B, Mixtral$3.50
NVIDIA A100 80GB80GBLarge models, fine-tuning$5.00
NVIDIA H10080GBMaximum performance$8.00

Cost Comparison Example

Scenario: 500,000 requests/day with Llama 3 70B

API Provider

~$4,500/month

Based on ~$0.0003/1K tokens

Dedicated GPU

~$2,520/month

A100 40GB × 24/7 = $2,520

💰 Save $1,980/month (44%) with dedicated GPU

Deployment Options

  • Managed: We handle deployment, scaling, and maintenance
  • Self-hosted: Run on your own infrastructure with our container images
  • Hybrid: Route between managed GPUs and API providers based on load

Chat Completions

The Chat Completions API is fully compatible with the OpenAI specification.

Request Format

POST /v1/chat/completionsjson
{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1000,
  "stream": false
}

Response Format

Responsejson
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "anthropic/claude-3-haiku",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  },
  "x-llm-gateway": {
    "request_id": "req_xyz789",
    "routing_decision": "cost_optimized",
    "cost_usd": 0.00012,
    "latency_ms": 234,
    "fallback_used": false
  }
}

Streaming

Stream responses in real-time using Server-Sent Events (SSE).

Streaming Requestpython
from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://api.llmroute.xyz/v1"
)

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Execution Receipts

Every request generates an execution receipt with full details about the routing decision, cost, and performance.

Execution Receiptjson
{
  "request_id": "req_abc123xyz",
  "timestamp": "2024-01-15T10:30:00Z",
  "routing": {
    "strategy": "cost_optimized",
    "candidates_evaluated": 5,
    "winner": "anthropic/claude-3-haiku",
    "reason": "Lowest cost meeting quality threshold"
  },
  "execution": {
    "provider": "anthropic",
    "model": "claude-3-haiku-20240307",
    "latency_ms": 234,
    "routing_overhead_ms": 12,
    "tokens": {
      "prompt": 150,
      "completion": 89,
      "total": 239
    }
  },
  "cost": {
    "provider_cost_usd": 0.00023,
    "platform_fee_usd": 0.00002,
    "total_usd": 0.00025,
    "savings_vs_default_usd": 0.00089
  },
  "fallback": {
    "used": false,
    "attempts": []
  }
}

Access receipts via the dashboard or the x-llm-gateway response header.

Rate Limits

Rate limits protect the system and ensure fair usage across all customers.

PlanRPMTPMDaily Requests
Free2040,0001,000
Starter100200,00010,000
Pro5001,000,000100,000
EnterpriseCustomCustomUnlimited

Rate Limit Headers

Response Headershttp
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067260

Dashboard Overview

The dashboard provides real-time visibility into your LLM usage, costs, and performance.

Analytics

Request volume, latency percentiles, and success rates

Cost Tracking

Real-time spend, savings analysis, and billing

Request Logs

Searchable logs with execution receipts

Configuration

Routing preferences, rate limits, and alerts

API Keys

Manage API keys for different environments and use cases.

Creating Keys

  1. Navigate to Settings → API Keys in the dashboard
  2. Click "Create New Key"
  3. Select environment (Production or Development)
  4. Set optional rate limits and expiration
  5. Copy and securely store your key

API keys are shown only once. Store them securely in your environment variables or secrets manager.

Billing & Usage

Pay only for what you use with transparent, per-request pricing.

Pricing Model

Provider costPass-through
Platform fee10% of provider cost
Your totalProvider + 10%

No monthly minimums. No hidden fees. Cancel anytime.

Routing Preferences

Customize how the auto-router selects models for your requests.

Routing Preferencesjson
{
  "default_priority": "cost",
  "quality_threshold": 0.8,
  "max_latency_ms": 3000,
  "allowed_providers": ["openai", "anthropic", "mistral"],
  "blocked_providers": [],
  "allowed_models": [],
  "blocked_models": [],
  "fallback_enabled": true,
  "fallback_chain": ["anthropic/claude-3-haiku", "mistral/mistral-small"]
}

Security Overview

LLM Gateway is built with security-first principles. Your data and API keys are protected by multiple layers of enterprise-grade security.

Encrypted at Rest & Transit

All data encrypted using AES-256. TLS 1.3 for all connections.

Secure API Keys

Keys are hashed with SHA-256. Raw keys shown only once on creation.

IP Allowlisting

Restrict API access to known IP addresses with CIDR support.

Full Audit Logging

Every action is logged for compliance and security review.

Infrastructure Security

  • Deployed on cloud infrastructure with industry-standard security
  • DDoS protection and Web Application Firewall (WAF)
  • Automated security updates and patching
  • Multi-region deployment for high availability

Access Controls

  • Role-based access control (RBAC) for team members
  • Per-API-key IP restrictions
  • Automatic key expiration and rotation reminders
  • Brute force protection with automatic lockout
  • Rate limiting at multiple levels (org, key, IP)

Data Protection

We take a privacy-first approach to handling your data. By default, we don't store your prompts or responses.

What We Store

Data TypeStoredPurpose
Request metadataYesBilling, analytics, debugging
Token countsYesAccurate billing
Latency metricsYesPerformance monitoring
Routing decisionsYesOptimization & transparency
Prompt contentNo*Optional for debugging
Response contentNo*Optional for debugging

* Content retention can be enabled per-organization for debugging purposes.

PII Protection

Built-in guardrails can detect and redact sensitive information before it reaches LLM providers:

  • Email addresses and phone numbers
  • Credit card numbers and SSNs
  • API keys and tokens
  • Custom patterns via regex
PII Detection Configurationjson
{
  "pii_detection_enabled": true,
  "pii_action": "redact",
  "pii_types": ["email", "phone", "ssn", "credit_card", "api_key"]
}

Compliance

LLM Gateway is built with security and privacy in mind, providing tools to help you meet your compliance requirements.

Audit Logging

Comprehensive audit trail for all actions

Data Privacy

No prompt/response storage by default

Access Control

RBAC, IP restrictions, key rotation

Data Residency

Control where your data is processed:

  • US region (default)
  • EU region available for GDPR compliance
  • Custom regions available for Enterprise plans

Audit & Reporting

Comprehensive audit logs for security and compliance:

  • API key creation, rotation, and revocation
  • Configuration changes and policy updates
  • Access attempts and authentication events
  • Export logs in JSON or CSV format
  • Configurable retention (7-365 days)

Need a Security Review?

Enterprise customers can request security questionnaires, penetration test reports, and custom compliance documentation.

Contact Security Team

OpenAI SDK

Use LLM Gateway as a drop-in replacement for the OpenAI SDK.

Pythonpython
from openai import OpenAI

client = OpenAI(
    api_key="your-llm-gateway-api-key",
    base_url="https://api.llmroute.xyz/v1"
)

# All OpenAI SDK methods work the same
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}]
)
Node.jsjavascript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.LLM_GATEWAY_API_KEY,
  baseURL: 'https://api.llmroute.xyz/v1',
});

const response = await client.chat.completions.create({
  model: 'auto',
  messages: [{ role: 'user', content: 'Hello!' }],
});

LangChain

Integrate LLM Gateway with LangChain for complex AI workflows.

LangChain Integrationpython
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="auto",
    openai_api_key="your-llm-gateway-api-key",
    openai_api_base="https://api.llmroute.xyz/v1"
)

# Use with chains, agents, etc.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["topic"],
    template="Write a short poem about {topic}."
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="the moon")

LlamaIndex

Use LLM Gateway with LlamaIndex for RAG and document Q&A.

LlamaIndex Integrationpython
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(
    model="auto",
    api_key="your-llm-gateway-api-key",
    api_base="https://api.llmroute.xyz/v1"
)

# Now use LlamaIndex as normal
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

Ready to get started?

Create your free account and start routing LLM requests in minutes.