Introduction

LLM Gateway is an intelligent routing layer that sits between your application and multiple LLM providers. It automatically selects the best model for each request based on cost, latency, reliability, and your preferences.

Cost Optimization

Save 30-50% on LLM costs with intelligent model selection

Automatic Failover

Never experience downtime with multi-provider fallbacks

Full Observability

Execution receipts for every request with routing decisions

Drop-in Compatible

Works with OpenAI SDK, LangChain, and any OpenAI-compatible client

Quick Start

Get up and running in under 5 minutes. Here's everything you need.

1. Get your API key

Sign up and create an API key from the dashboard.

2. Make your first request

Use cURL or any HTTP client to make a request:

First API Callbash
curl https://llm-gateway-kqks.onrender.com/v1/chat/completions \
  -H "Authorization: Bearer $LLM_GATEWAY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "Hello, world!"}
    ]
  }'

3. Or use the OpenAI SDK

Just change the base URL—everything else stays the same:

Pythonpython
from openai import OpenAI

client = OpenAI(
    api_key="your-llm-gateway-api-key",
    base_url="https://llm-gateway-kqks.onrender.com/v1"
)

response = client.chat.completions.create(
    model="auto",  # Let the gateway choose the best model
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ]
)

print(response.choices[0].message.content)
Node.jsjavascript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: 'your-llm-gateway-api-key',
  baseURL: 'https://llm-gateway-kqks.onrender.com/v1',
});

const response = await client.chat.completions.create({
  model: 'auto',
  messages: [
    { role: 'user', content: 'Hello, world!' }
  ],
});

console.log(response.choices[0].message.content);

Authentication

All API requests require authentication using a Bearer token.

Authorization Headerhttp
Authorization: Bearer llm_gw_xxxxxxxxxxxxxxxxxxxx

API Key Types

TypePrefixUse Case
Productionllm_gw_prod_Live production traffic
Developmentllm_gw_dev_Testing and development

Security Best Practices

  • • Never expose API keys in client-side code
  • • Use environment variables to store keys
  • • Rotate keys periodically
  • • Use different keys for development and production

Auto-Routing

Auto-routing is the core feature of LLM Gateway. When you set model: "auto", the gateway evaluates multiple factors to select the optimal model.

How It Works

1

Request Analysis

We analyze your request including prompt complexity, expected output length, and any constraints you specify.

2

Provider Evaluation

We check real-time availability, latency, and cost across all enabled providers.

3

Model Selection

Using your preferences and our optimization algorithms, we select the best model.

4

Execution & Fallback

We execute the request with automatic fallback if the primary choice fails.

Routing Hints

You can provide hints to influence routing decisions:

Routing Hintsjson
{
  "model": "auto",
  "messages": [...],
  "x-routing-hints": {
    "priority": "cost",        // "cost" | "latency" | "quality"
    "max_latency_ms": 2000,    // Maximum acceptable latency
    "prefer_providers": ["anthropic", "openai"],
    "exclude_providers": ["cohere"],
    "min_quality_score": 0.8
  }
}

Providers & Models

LLM Gateway supports all major LLM providers. You can use auto-routing or specify a model directly.

Supported Providers

ProviderModelsFeatures
OpenAIGPT-4o, GPT-4 Turbo, GPT-3.5 TurboFunctions, Vision, JSON mode
AnthropicClaude 3.5 Sonnet, Claude 3 Opus, Claude 3 HaikuLong context, Tool use
GoogleGemini Pro, Gemini UltraMultimodal, Long context
MistralMistral Large, Mistral Medium, Mistral SmallFast inference, Cost-effective
CohereCommand R+, Command RRAG optimized, Multilingual

Direct Model Access

You can also request a specific model directly:

Specific Modeljson
{
  "model": "openai/gpt-4o",
  "messages": [...]
}

// Or with provider prefix
{
  "model": "anthropic/claude-3-5-sonnet-20241022",
  "messages": [...]
}

Fallbacks & Reliability

LLM Gateway automatically handles failures with intelligent fallback strategies.

Automatic Fallback

When a provider fails, we automatically retry with the next best option:

Primary fails
Retry secondary
Success

Circuit Breaker

We implement circuit breakers to prevent cascading failures:

  • Closed: Normal operation, requests pass through
  • Open: Provider temporarily disabled after repeated failures
  • Half-Open: Testing provider recovery with limited traffic

Custom Fallback Chain

Custom Fallbackjson
{
  "model": "auto",
  "messages": [...],
  "x-fallback-chain": [
    "anthropic/claude-3-5-sonnet",
    "openai/gpt-4o",
    "mistral/mistral-large"
  ]
}

Cost Optimization

LLM Gateway helps you save 30-50% on LLM costs through intelligent routing and real-time cost analysis.

How We Optimize Costs

Smart Model Selection

We match request complexity to the most cost-effective model that meets quality requirements.

Real-time Pricing

We track pricing changes across providers and adjust routing instantly.

Spend Limits

Set daily, weekly, or monthly spend limits to prevent unexpected charges.

Cost Alerts

Get notified when spending exceeds thresholds or anomalies are detected.

Setting Spend Limits

Spend Limits (Dashboard or API)json
{
  "limits": {
    "daily_usd": 100,
    "monthly_usd": 2500,
    "per_request_usd": 0.50
  },
  "alerts": {
    "threshold_percent": 80,
    "webhook_url": "https://your-app.com/webhook/spend-alert"
  }
}

Chat Completions

The Chat Completions API is fully compatible with the OpenAI specification.

Request Format

POST /v1/chat/completionsjson
{
  "model": "auto",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1000,
  "stream": false
}

Response Format

Responsejson
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "anthropic/claude-3-haiku",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 8,
    "total_tokens": 33
  },
  "x-llm-gateway": {
    "request_id": "req_xyz789",
    "routing_decision": "cost_optimized",
    "cost_usd": 0.00012,
    "latency_ms": 234,
    "fallback_used": false
  }
}

Streaming

Stream responses in real-time using Server-Sent Events (SSE).

Streaming Requestpython
from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://llm-gateway-kqks.onrender.com/v1"
)

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Execution Receipts

Every request generates an execution receipt with full details about the routing decision, cost, and performance.

Execution Receiptjson
{
  "request_id": "req_abc123xyz",
  "timestamp": "2024-01-15T10:30:00Z",
  "routing": {
    "strategy": "cost_optimized",
    "candidates_evaluated": 5,
    "winner": "anthropic/claude-3-haiku",
    "reason": "Lowest cost meeting quality threshold"
  },
  "execution": {
    "provider": "anthropic",
    "model": "claude-3-haiku-20240307",
    "latency_ms": 234,
    "routing_overhead_ms": 12,
    "tokens": {
      "prompt": 150,
      "completion": 89,
      "total": 239
    }
  },
  "cost": {
    "provider_cost_usd": 0.00023,
    "platform_fee_usd": 0.00002,
    "total_usd": 0.00025,
    "savings_vs_default_usd": 0.00089
  },
  "fallback": {
    "used": false,
    "attempts": []
  }
}

Access receipts via the dashboard or the x-llm-gateway response header.

Rate Limits

Rate limits protect the system and ensure fair usage across all customers.

PlanRPMTPMDaily Requests
Free2040,0001,000
Starter100200,00010,000
Pro5001,000,000100,000
EnterpriseCustomCustomUnlimited

Rate Limit Headers

Response Headershttp
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067260

Dashboard Overview

The dashboard provides real-time visibility into your LLM usage, costs, and performance.

Analytics

Request volume, latency percentiles, and success rates

Cost Tracking

Real-time spend, savings analysis, and billing

Request Logs

Searchable logs with execution receipts

Configuration

Routing preferences, rate limits, and alerts

API Keys

Manage API keys for different environments and use cases.

Creating Keys

  1. Navigate to Settings → API Keys in the dashboard
  2. Click "Create New Key"
  3. Select environment (Production or Development)
  4. Set optional rate limits and expiration
  5. Copy and securely store your key

API keys are shown only once. Store them securely in your environment variables or secrets manager.

Billing & Usage

Pay only for what you use with transparent, per-request pricing.

Pricing Model

Provider costPass-through
Platform fee10% of provider cost
Your totalProvider + 10%

No monthly minimums. No hidden fees. Cancel anytime.

Routing Preferences

Customize how the auto-router selects models for your requests.

Routing Preferencesjson
{
  "default_priority": "cost",
  "quality_threshold": 0.8,
  "max_latency_ms": 3000,
  "allowed_providers": ["openai", "anthropic", "mistral"],
  "blocked_providers": [],
  "allowed_models": [],
  "blocked_models": [],
  "fallback_enabled": true,
  "fallback_chain": ["anthropic/claude-3-haiku", "mistral/mistral-small"]
}

OpenAI SDK

Use LLM Gateway as a drop-in replacement for the OpenAI SDK.

Pythonpython
from openai import OpenAI

client = OpenAI(
    api_key="your-llm-gateway-api-key",
    base_url="https://llm-gateway-kqks.onrender.com/v1"
)

# All OpenAI SDK methods work the same
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Hello!"}]
)
Node.jsjavascript
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.LLM_GATEWAY_API_KEY,
  baseURL: 'https://llm-gateway-kqks.onrender.com/v1',
});

const response = await client.chat.completions.create({
  model: 'auto',
  messages: [{ role: 'user', content: 'Hello!' }],
});

LangChain

Integrate LLM Gateway with LangChain for complex AI workflows.

LangChain Integrationpython
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="auto",
    openai_api_key="your-llm-gateway-api-key",
    openai_api_base="https://llm-gateway-kqks.onrender.com/v1"
)

# Use with chains, agents, etc.
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["topic"],
    template="Write a short poem about {topic}."
)

chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(topic="the moon")

LlamaIndex

Use LLM Gateway with LlamaIndex for RAG and document Q&A.

LlamaIndex Integrationpython
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(
    model="auto",
    api_key="your-llm-gateway-api-key",
    api_base="https://llm-gateway-kqks.onrender.com/v1"
)

# Now use LlamaIndex as normal
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What is this document about?")

Ready to get started?

Create your free account and start routing LLM requests in minutes.