Introduction
LLM Gateway is an intelligent routing layer that sits between your application and multiple LLM providers. It automatically selects the best model for each request based on cost, latency, reliability, and your preferences.
Cost Optimization
Save 30-50% on LLM costs with intelligent model selection
Automatic Failover
Never experience downtime with multi-provider fallbacks
Full Observability
Execution receipts for every request with routing decisions
Drop-in Compatible
Works with OpenAI SDK, LangChain, and any OpenAI-compatible client
Quick Start
Get up and running in under 5 minutes. Here's everything you need.
1. Get your API key
Sign up and create an API key from the dashboard.
2. Make your first request
Use cURL or any HTTP client to make a request:
curl https://api.llmroute.xyz/v1/chat/completions \
-H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Hello, world!"}
]
}'3. Or use the OpenAI SDK
Just change the base URL—everything else stays the same:
from openai import OpenAI
client = OpenAI(
api_key="your-llm-gateway-api-key",
base_url="https://api.llmroute.xyz/v1"
)
response = client.chat.completions.create(
model="auto", # Let the gateway choose the best model
messages=[
{"role": "user", "content": "Hello, world!"}
]
)
print(response.choices[0].message.content)import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'your-llm-gateway-api-key',
baseURL: 'https://api.llmroute.xyz/v1',
});
const response = await client.chat.completions.create({
model: 'auto',
messages: [
{ role: 'user', content: 'Hello, world!' }
],
});
console.log(response.choices[0].message.content);Authentication
All API requests require authentication using a Bearer token.
Authorization: Bearer llm_gw_xxxxxxxxxxxxxxxxxxxxAPI Key Types
| Type | Prefix | Use Case |
|---|---|---|
| Production | llm_gw_prod_ | Live production traffic |
| Development | llm_gw_dev_ | Testing and development |
Security Best Practices
- • Never expose API keys in client-side code
- • Use environment variables to store keys
- • Rotate keys periodically
- • Use different keys for development and production
Auto-Routing
Auto-routing is the core feature of LLM Gateway. When you set model: "auto", the gateway evaluates multiple factors to select the optimal model.
How It Works
Request Analysis
We analyze your request including prompt complexity, expected output length, and any constraints you specify.
Provider Evaluation
We check real-time availability, latency, and cost across all enabled providers.
Model Selection
Using your preferences and our optimization algorithms, we select the best model.
Execution & Fallback
We execute the request with automatic fallback if the primary choice fails.
Routing Hints
You can provide hints to influence routing decisions:
{
"model": "auto",
"messages": [...],
"x-routing-hints": {
"priority": "cost", // "cost" | "latency" | "quality"
"max_latency_ms": 2000, // Maximum acceptable latency
"prefer_providers": ["anthropic", "openai"],
"exclude_providers": ["cohere"],
"min_quality_score": 0.8
}
}Providers & Models
LLM Gateway supports all major LLM providers. You can use auto-routing or specify a model directly.
Supported Providers
| Provider | Models | Features |
|---|---|---|
| OpenAI | GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo | Functions, Vision, JSON mode |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku | Long context, Tool use |
| Gemini Pro, Gemini Ultra | Multimodal, Long context | |
| Mistral | Mistral Large, Mistral Medium, Mistral Small | Fast inference, Cost-effective |
| Cohere | Command R+, Command R | RAG optimized, Multilingual |
Direct Model Access
You can also request a specific model directly:
{
"model": "openai/gpt-4o",
"messages": [...]
}
// Or with provider prefix
{
"model": "anthropic/claude-3-5-sonnet-20241022",
"messages": [...]
}Fallbacks & Reliability
LLM Gateway automatically handles failures with intelligent fallback strategies.
Automatic Fallback
When a provider fails, we automatically retry with the next best option:
Circuit Breaker
We implement circuit breakers to prevent cascading failures:
- Closed: Normal operation, requests pass through
- Open: Provider temporarily disabled after repeated failures
- Half-Open: Testing provider recovery with limited traffic
Custom Fallback Chain
{
"model": "auto",
"messages": [...],
"x-fallback-chain": [
"anthropic/claude-3-5-sonnet",
"openai/gpt-4o",
"mistral/mistral-large"
]
}Cost Optimization
LLM Gateway helps you save 30-50% on LLM costs through intelligent routing and real-time cost analysis.
How We Optimize Costs
Smart Model Selection
We match request complexity to the most cost-effective model that meets quality requirements.
Real-time Pricing
We track pricing changes across providers and adjust routing instantly.
Spend Limits
Set daily, weekly, or monthly spend limits to prevent unexpected charges.
Cost Alerts
Get notified when spending exceeds thresholds or anomalies are detected.
Setting Spend Limits
{
"limits": {
"daily_usd": 100,
"monthly_usd": 2500,
"per_request_usd": 0.50
},
"alerts": {
"threshold_percent": 80,
"webhook_url": "https://your-app.com/webhook/spend-alert"
}
}Semantic Cache
Semantic caching reduces costs by up to 90% for repeated or similar queries. Unlike exact-match caching, semantic cache understands the meaning of your prompts and returns cached responses for semantically similar requests.
💰 Real Savings Example
A customer support chatbot handling 10,000 queries/day saw 73% cache hits, reducing their monthly LLM costs from $3,200 to $864—saving over $2,300/month.
How It Works
Embedding Generation
Your prompt is converted to a vector embedding that captures its semantic meaning.
Similarity Search
We search for cached prompts with similar embeddings using vector similarity (cosine distance).
Threshold Check
If similarity exceeds your threshold (default 95%), the cached response is returned instantly.
Cache Miss
If no match is found, the request goes to the LLM and the response is cached for future use.
Example: Similar Queries Hit Cache
Query 1: "What is the capital of France?"
→ LLM call, response cached
Query 2: "What's France's capital city?"
→ Cache hit! 98% similarity, instant response
Query 3: "Tell me the capital of France"
→ Cache hit! 96% similarity, instant response
Configuration
{
"cache_enabled": true,
"similarity_threshold": 0.95, // 0.90-0.99, higher = stricter matching
"ttl_seconds": 3600, // Cache expiration (1 hour default)
"exclude_models": [] // Models to never cache
}Cache is isolated per organization—your cached responses are never shared with other customers.
Provider Connections
Connect your own API keys from LLM providers to get direct pricing and use your existing quotas. LLM Gateway manages routing, fallbacks, and observability while you maintain your provider relationships.
Supported Providers
OpenAI
GPT-4o, GPT-4 Turbo, GPT-3.5
Anthropic
Claude 3.5 Sonnet, Claude 3 Opus
Google AI
Gemini Pro, Gemini Ultra
Mistral
Mistral Large, Medium, Small
Groq
Llama 3, Mixtral (ultra-fast)
Together AI
Open source models
Cohere
Command R+, Command R
AWS Bedrock
Claude, Llama, Titan
Benefits of BYOK (Bring Your Own Keys)
- Direct pricing: Pay provider rates directly, no markup on API costs
- Use existing quotas: Leverage your negotiated rate limits and credits
- Enterprise agreements: Maintain your BAAs and compliance contracts
- Gradual migration: Test LLM Gateway without changing billing
How to Connect a Provider
- Navigate to Connections in the dashboard
- Click Add Connection and select your provider
- Enter your API key (encrypted with AES-256)
- Test the connection to verify it works
- Enable the provider for routing
Security Note
Your API keys are encrypted at rest using AES-256 and envelope encryption. Keys are only decrypted in memory when making requests to providers.
AI Recommendations
LLM Gateway analyzes your usage patterns and provides personalized recommendations to optimize costs, improve performance, and enhance reliability.
Types of Recommendations
Cost Savings
Identify cheaper models that provide equivalent quality for your use cases.
"Switch from GPT-4 to Claude 3 Haiku for simple Q&A—save 85% with similar quality."
Performance Optimization
Find faster models or providers with lower latency for your region.
"Route to Groq for latency-sensitive requests—get 5x faster responses."
Reliability Improvements
Add fallback providers or adjust retry strategies based on failure patterns.
"Add Anthropic as fallback—reduce failed requests by 94%."
Usage Insights
Understand your traffic patterns and optimize accordingly.
"Enable semantic cache—73% of your queries are semantically similar."
How Recommendations Work
- We analyze your last 30 days of usage data
- Recommendations are generated weekly or on-demand
- Each recommendation includes estimated savings/impact
- Apply recommendations with one click in the dashboard
- Track the impact after applying changes
GPU Compute
Run open-source models on dedicated GPU infrastructure for maximum control, privacy, and cost efficiency at scale.
🚀 When to Use GPU Compute
- • High volume: 100K+ requests/day where per-token costs add up
- • Data privacy: Keep all data on your own infrastructure
- • Custom models: Run fine-tuned or specialized models
- • Predictable costs: Fixed GPU pricing vs variable per-token
Available GPU Options
| GPU | VRAM | Best For | Price/hr |
|---|---|---|---|
| NVIDIA A10G | 24GB | Llama 3 8B, Mistral 7B | $1.20 |
| NVIDIA A100 40GB | 40GB | Llama 3 70B, Mixtral | $3.50 |
| NVIDIA A100 80GB | 80GB | Large models, fine-tuning | $5.00 |
| NVIDIA H100 | 80GB | Maximum performance | $8.00 |
Cost Comparison Example
Scenario: 500,000 requests/day with Llama 3 70B
API Provider
~$4,500/month
Based on ~$0.0003/1K tokens
Dedicated GPU
~$2,520/month
A100 40GB × 24/7 = $2,520
💰 Save $1,980/month (44%) with dedicated GPU
Deployment Options
- Managed: We handle deployment, scaling, and maintenance
- Self-hosted: Run on your own infrastructure with our container images
- Hybrid: Route between managed GPUs and API providers based on load
Chat Completions
The Chat Completions API is fully compatible with the OpenAI specification.
Request Format
{
"model": "auto",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 1000,
"stream": false
}Response Format
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "anthropic/claude-3-haiku",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
},
"x-llm-gateway": {
"request_id": "req_xyz789",
"routing_decision": "cost_optimized",
"cost_usd": 0.00012,
"latency_ms": 234,
"fallback_used": false
}
}Streaming
Stream responses in real-time using Server-Sent Events (SSE).
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://api.llmroute.xyz/v1"
)
stream = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Execution Receipts
Every request generates an execution receipt with full details about the routing decision, cost, and performance.
{
"request_id": "req_abc123xyz",
"timestamp": "2024-01-15T10:30:00Z",
"routing": {
"strategy": "cost_optimized",
"candidates_evaluated": 5,
"winner": "anthropic/claude-3-haiku",
"reason": "Lowest cost meeting quality threshold"
},
"execution": {
"provider": "anthropic",
"model": "claude-3-haiku-20240307",
"latency_ms": 234,
"routing_overhead_ms": 12,
"tokens": {
"prompt": 150,
"completion": 89,
"total": 239
}
},
"cost": {
"provider_cost_usd": 0.00023,
"platform_fee_usd": 0.00002,
"total_usd": 0.00025,
"savings_vs_default_usd": 0.00089
},
"fallback": {
"used": false,
"attempts": []
}
}Access receipts via the dashboard or the x-llm-gateway response header.
Rate Limits
Rate limits protect the system and ensure fair usage across all customers.
| Plan | RPM | TPM | Daily Requests |
|---|---|---|---|
| Free | 20 | 40,000 | 1,000 |
| Starter | 100 | 200,000 | 10,000 |
| Pro | 500 | 1,000,000 | 100,000 |
| Enterprise | Custom | Custom | Unlimited |
Rate Limit Headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067260Dashboard Overview
The dashboard provides real-time visibility into your LLM usage, costs, and performance.
Analytics
Request volume, latency percentiles, and success rates
Cost Tracking
Real-time spend, savings analysis, and billing
Request Logs
Searchable logs with execution receipts
Configuration
Routing preferences, rate limits, and alerts
API Keys
Manage API keys for different environments and use cases.
Creating Keys
- Navigate to Settings → API Keys in the dashboard
- Click "Create New Key"
- Select environment (Production or Development)
- Set optional rate limits and expiration
- Copy and securely store your key
API keys are shown only once. Store them securely in your environment variables or secrets manager.
Billing & Usage
Pay only for what you use with transparent, per-request pricing.
Pricing Model
No monthly minimums. No hidden fees. Cancel anytime.
Routing Preferences
Customize how the auto-router selects models for your requests.
{
"default_priority": "cost",
"quality_threshold": 0.8,
"max_latency_ms": 3000,
"allowed_providers": ["openai", "anthropic", "mistral"],
"blocked_providers": [],
"allowed_models": [],
"blocked_models": [],
"fallback_enabled": true,
"fallback_chain": ["anthropic/claude-3-haiku", "mistral/mistral-small"]
}Security Overview
LLM Gateway is built with security-first principles. Your data and API keys are protected by multiple layers of enterprise-grade security.
Encrypted at Rest & Transit
All data encrypted using AES-256. TLS 1.3 for all connections.
Secure API Keys
Keys are hashed with SHA-256. Raw keys shown only once on creation.
IP Allowlisting
Restrict API access to known IP addresses with CIDR support.
Full Audit Logging
Every action is logged for compliance and security review.
Infrastructure Security
- Deployed on cloud infrastructure with industry-standard security
- DDoS protection and Web Application Firewall (WAF)
- Automated security updates and patching
- Multi-region deployment for high availability
Access Controls
- Role-based access control (RBAC) for team members
- Per-API-key IP restrictions
- Automatic key expiration and rotation reminders
- Brute force protection with automatic lockout
- Rate limiting at multiple levels (org, key, IP)
Data Protection
We take a privacy-first approach to handling your data. By default, we don't store your prompts or responses.
What We Store
| Data Type | Stored | Purpose |
|---|---|---|
| Request metadata | Yes | Billing, analytics, debugging |
| Token counts | Yes | Accurate billing |
| Latency metrics | Yes | Performance monitoring |
| Routing decisions | Yes | Optimization & transparency |
| Prompt content | No* | Optional for debugging |
| Response content | No* | Optional for debugging |
* Content retention can be enabled per-organization for debugging purposes.
PII Protection
Built-in guardrails can detect and redact sensitive information before it reaches LLM providers:
- Email addresses and phone numbers
- Credit card numbers and SSNs
- API keys and tokens
- Custom patterns via regex
{
"pii_detection_enabled": true,
"pii_action": "redact",
"pii_types": ["email", "phone", "ssn", "credit_card", "api_key"]
}Compliance
LLM Gateway is built with security and privacy in mind, providing tools to help you meet your compliance requirements.
Audit Logging
Comprehensive audit trail for all actions
Data Privacy
No prompt/response storage by default
Access Control
RBAC, IP restrictions, key rotation
Data Residency
Control where your data is processed:
- US region (default)
- EU region available for GDPR compliance
- Custom regions available for Enterprise plans
Audit & Reporting
Comprehensive audit logs for security and compliance:
- API key creation, rotation, and revocation
- Configuration changes and policy updates
- Access attempts and authentication events
- Export logs in JSON or CSV format
- Configurable retention (7-365 days)
Need a Security Review?
Enterprise customers can request security questionnaires, penetration test reports, and custom compliance documentation.
Contact Security TeamOpenAI SDK
Use LLM Gateway as a drop-in replacement for the OpenAI SDK.
from openai import OpenAI
client = OpenAI(
api_key="your-llm-gateway-api-key",
base_url="https://api.llmroute.xyz/v1"
)
# All OpenAI SDK methods work the same
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}]
)import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.LLM_GATEWAY_API_KEY,
baseURL: 'https://api.llmroute.xyz/v1',
});
const response = await client.chat.completions.create({
model: 'auto',
messages: [{ role: 'user', content: 'Hello!' }],
});LangChain
Integrate LLM Gateway with LangChain for complex AI workflows.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="auto",
openai_api_key="your-llm-gateway-api-key",
openai_api_base="https://api.llmroute.xyz/v1"
)
# Use with chains, agents, etc.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
input_variables=["topic"],
template="Write a short poem about {topic}."
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="the moon")LlamaIndex
Use LLM Gateway with LlamaIndex for RAG and document Q&A.
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(
model="auto",
api_key="your-llm-gateway-api-key",
api_base="https://api.llmroute.xyz/v1"
)
# Now use LlamaIndex as normal
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")Ready to get started?
Create your free account and start routing LLM requests in minutes.