Introduction
LLM Gateway is an intelligent routing layer that sits between your application and multiple LLM providers. It automatically selects the best model for each request based on cost, latency, reliability, and your preferences.
Cost Optimization
Save 30-50% on LLM costs with intelligent model selection
Automatic Failover
Never experience downtime with multi-provider fallbacks
Full Observability
Execution receipts for every request with routing decisions
Drop-in Compatible
Works with OpenAI SDK, LangChain, and any OpenAI-compatible client
Quick Start
Get up and running in under 5 minutes. Here's everything you need.
1. Get your API key
Sign up and create an API key from the dashboard.
2. Make your first request
Use cURL or any HTTP client to make a request:
curl https://llm-gateway-kqks.onrender.com/v1/chat/completions \
-H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "Hello, world!"}
]
}'3. Or use the OpenAI SDK
Just change the base URL—everything else stays the same:
from openai import OpenAI
client = OpenAI(
api_key="your-llm-gateway-api-key",
base_url="https://llm-gateway-kqks.onrender.com/v1"
)
response = client.chat.completions.create(
model="auto", # Let the gateway choose the best model
messages=[
{"role": "user", "content": "Hello, world!"}
]
)
print(response.choices[0].message.content)import OpenAI from 'openai';
const client = new OpenAI({
apiKey: 'your-llm-gateway-api-key',
baseURL: 'https://llm-gateway-kqks.onrender.com/v1',
});
const response = await client.chat.completions.create({
model: 'auto',
messages: [
{ role: 'user', content: 'Hello, world!' }
],
});
console.log(response.choices[0].message.content);Authentication
All API requests require authentication using a Bearer token.
Authorization: Bearer llm_gw_xxxxxxxxxxxxxxxxxxxxAPI Key Types
| Type | Prefix | Use Case |
|---|---|---|
| Production | llm_gw_prod_ | Live production traffic |
| Development | llm_gw_dev_ | Testing and development |
Security Best Practices
- • Never expose API keys in client-side code
- • Use environment variables to store keys
- • Rotate keys periodically
- • Use different keys for development and production
Auto-Routing
Auto-routing is the core feature of LLM Gateway. When you set model: "auto", the gateway evaluates multiple factors to select the optimal model.
How It Works
Request Analysis
We analyze your request including prompt complexity, expected output length, and any constraints you specify.
Provider Evaluation
We check real-time availability, latency, and cost across all enabled providers.
Model Selection
Using your preferences and our optimization algorithms, we select the best model.
Execution & Fallback
We execute the request with automatic fallback if the primary choice fails.
Routing Hints
You can provide hints to influence routing decisions:
{
"model": "auto",
"messages": [...],
"x-routing-hints": {
"priority": "cost", // "cost" | "latency" | "quality"
"max_latency_ms": 2000, // Maximum acceptable latency
"prefer_providers": ["anthropic", "openai"],
"exclude_providers": ["cohere"],
"min_quality_score": 0.8
}
}Providers & Models
LLM Gateway supports all major LLM providers. You can use auto-routing or specify a model directly.
Supported Providers
| Provider | Models | Features |
|---|---|---|
| OpenAI | GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo | Functions, Vision, JSON mode |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku | Long context, Tool use |
| Gemini Pro, Gemini Ultra | Multimodal, Long context | |
| Mistral | Mistral Large, Mistral Medium, Mistral Small | Fast inference, Cost-effective |
| Cohere | Command R+, Command R | RAG optimized, Multilingual |
Direct Model Access
You can also request a specific model directly:
{
"model": "openai/gpt-4o",
"messages": [...]
}
// Or with provider prefix
{
"model": "anthropic/claude-3-5-sonnet-20241022",
"messages": [...]
}Fallbacks & Reliability
LLM Gateway automatically handles failures with intelligent fallback strategies.
Automatic Fallback
When a provider fails, we automatically retry with the next best option:
Circuit Breaker
We implement circuit breakers to prevent cascading failures:
- Closed: Normal operation, requests pass through
- Open: Provider temporarily disabled after repeated failures
- Half-Open: Testing provider recovery with limited traffic
Custom Fallback Chain
{
"model": "auto",
"messages": [...],
"x-fallback-chain": [
"anthropic/claude-3-5-sonnet",
"openai/gpt-4o",
"mistral/mistral-large"
]
}Cost Optimization
LLM Gateway helps you save 30-50% on LLM costs through intelligent routing and real-time cost analysis.
How We Optimize Costs
Smart Model Selection
We match request complexity to the most cost-effective model that meets quality requirements.
Real-time Pricing
We track pricing changes across providers and adjust routing instantly.
Spend Limits
Set daily, weekly, or monthly spend limits to prevent unexpected charges.
Cost Alerts
Get notified when spending exceeds thresholds or anomalies are detected.
Setting Spend Limits
{
"limits": {
"daily_usd": 100,
"monthly_usd": 2500,
"per_request_usd": 0.50
},
"alerts": {
"threshold_percent": 80,
"webhook_url": "https://your-app.com/webhook/spend-alert"
}
}Chat Completions
The Chat Completions API is fully compatible with the OpenAI specification.
Request Format
{
"model": "auto",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 1000,
"stream": false
}Response Format
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "anthropic/claude-3-haiku",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 8,
"total_tokens": 33
},
"x-llm-gateway": {
"request_id": "req_xyz789",
"routing_decision": "cost_optimized",
"cost_usd": 0.00012,
"latency_ms": 234,
"fallback_used": false
}
}Streaming
Stream responses in real-time using Server-Sent Events (SSE).
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://llm-gateway-kqks.onrender.com/v1"
)
stream = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Execution Receipts
Every request generates an execution receipt with full details about the routing decision, cost, and performance.
{
"request_id": "req_abc123xyz",
"timestamp": "2024-01-15T10:30:00Z",
"routing": {
"strategy": "cost_optimized",
"candidates_evaluated": 5,
"winner": "anthropic/claude-3-haiku",
"reason": "Lowest cost meeting quality threshold"
},
"execution": {
"provider": "anthropic",
"model": "claude-3-haiku-20240307",
"latency_ms": 234,
"routing_overhead_ms": 12,
"tokens": {
"prompt": 150,
"completion": 89,
"total": 239
}
},
"cost": {
"provider_cost_usd": 0.00023,
"platform_fee_usd": 0.00002,
"total_usd": 0.00025,
"savings_vs_default_usd": 0.00089
},
"fallback": {
"used": false,
"attempts": []
}
}Access receipts via the dashboard or the x-llm-gateway response header.
Rate Limits
Rate limits protect the system and ensure fair usage across all customers.
| Plan | RPM | TPM | Daily Requests |
|---|---|---|---|
| Free | 20 | 40,000 | 1,000 |
| Starter | 100 | 200,000 | 10,000 |
| Pro | 500 | 1,000,000 | 100,000 |
| Enterprise | Custom | Custom | Unlimited |
Rate Limit Headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067260Dashboard Overview
The dashboard provides real-time visibility into your LLM usage, costs, and performance.
Analytics
Request volume, latency percentiles, and success rates
Cost Tracking
Real-time spend, savings analysis, and billing
Request Logs
Searchable logs with execution receipts
Configuration
Routing preferences, rate limits, and alerts
API Keys
Manage API keys for different environments and use cases.
Creating Keys
- Navigate to Settings → API Keys in the dashboard
- Click "Create New Key"
- Select environment (Production or Development)
- Set optional rate limits and expiration
- Copy and securely store your key
API keys are shown only once. Store them securely in your environment variables or secrets manager.
Billing & Usage
Pay only for what you use with transparent, per-request pricing.
Pricing Model
No monthly minimums. No hidden fees. Cancel anytime.
Routing Preferences
Customize how the auto-router selects models for your requests.
{
"default_priority": "cost",
"quality_threshold": 0.8,
"max_latency_ms": 3000,
"allowed_providers": ["openai", "anthropic", "mistral"],
"blocked_providers": [],
"allowed_models": [],
"blocked_models": [],
"fallback_enabled": true,
"fallback_chain": ["anthropic/claude-3-haiku", "mistral/mistral-small"]
}OpenAI SDK
Use LLM Gateway as a drop-in replacement for the OpenAI SDK.
from openai import OpenAI
client = OpenAI(
api_key="your-llm-gateway-api-key",
base_url="https://llm-gateway-kqks.onrender.com/v1"
)
# All OpenAI SDK methods work the same
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Hello!"}]
)import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.LLM_GATEWAY_API_KEY,
baseURL: 'https://llm-gateway-kqks.onrender.com/v1',
});
const response = await client.chat.completions.create({
model: 'auto',
messages: [{ role: 'user', content: 'Hello!' }],
});LangChain
Integrate LLM Gateway with LangChain for complex AI workflows.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="auto",
openai_api_key="your-llm-gateway-api-key",
openai_api_base="https://llm-gateway-kqks.onrender.com/v1"
)
# Use with chains, agents, etc.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
input_variables=["topic"],
template="Write a short poem about {topic}."
)
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="the moon")LlamaIndex
Use LLM Gateway with LlamaIndex for RAG and document Q&A.
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(
model="auto",
api_key="your-llm-gateway-api-key",
api_base="https://llm-gateway-kqks.onrender.com/v1"
)
# Now use LlamaIndex as normal
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")Ready to get started?
Create your free account and start routing LLM requests in minutes.