A configurable API gateway for multiple LLM providers (OpenAI, Anthropic, Gemini, Ollama) with built-in analytics, guardrails, and administrative controls.
- Create file named
Config.toml
with following content
[openAIConfig]
apiKey = "Your_API_Key"
model = "gpt-4"
endpoint = "https://api.openai.com"
- Run below docker command
docker run -p \
8080:8080 -p 8081:8081 -p 8082:8082 \
-v $(pwd)/Config.toml:/home/ballerina/Config.toml \
chintana/ai-gateway:v1.1.0
- Start sending requests
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "x-llm-provider: openai" \
-d '
{
"messages": [
{
"role": "user",
"content": "Solve world hunger"
}
]
}
'
Use any OpenAI compatible SDK to talk to the gateway. Following example use official OpenAI Python SDK
- Install OpenAI official Python SDK
pip install openai
- Example client. Note that setting the model and api key is enforced by the SDK. However these will be ignored by the gateway and will use whatever model and key configured at the gateway.
import openai
openai.api_key = '...' # Required by the SDK, AI Gateway will ignore this
# all client options can be configured just like the `OpenAI` instantiation counterpart
openai.base_url = "http://localhost:8080/v1/"
openai.default_headers = {"x-llm-provider": "openai"}
# Setting the model is enforced by the SDK. AI Gateway will ignore this value
completion = openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": "Solve world hunger",
},
],
)
print(completion.choices[0].message.content)
- Multi-Provider Support: Route requests to OpenAI, Anthropic, Gemini, Ollama, and Cohere
- Automatic Failover: When 2+ providers are configured, automatically fails over to alternative providers if primary provider fails
- Rate Limiting: Rate limiting policies
- OpenAI compatible interface: Standardized input and output based on OpenAI API inteface
- Response Caching: In-memory cache with configurable TTL for improved performance and reduced API costs
- System Prompts: Inject system prompts into all LLM requests
- Response Guardrails: Configure content filtering and response constraints
- Analytics Dashboard: Monitor usage, tokens, and errors with visual charts
- Admin UI: Configure AI gateway
- Administrative Controls: Configure gateway behavior via admin API
OpenAI compatible request interface
curl --location 'http://localhost:8080/v1/chat/completions' \
--header 'x-llm-provider: ollama' \
--header 'Content-Type: application/json' \
--data '{
"messages": [{
"role": "user",
"content": "When will we have AGI? In 10 words"
}]
}
'
OpenAI API compatible response
{
"id": "01eff23c-208f-15a8-acdc-f400bba1bc6d",
"object": "chat.completion",
"created": 1740352553,
"model": "llama3.1:latest",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Estimating exact timeline uncertain, but likely within next few decades."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 27,
"completion_tokens": 14,
"total_tokens": 41
}
}
An example client in Python is available in grpc-client
folder
def run():
# Create a gRPC channel
channel = grpc.insecure_channel('localhost:8082')
# Create a stub (client)
stub = ai_gateway_pb2_grpc.AIGatewayStub(channel)
# Create a request
request = ai_gateway_pb2.ChatCompletionRequest(
llm_provider="ollama",
messages=[
ai_gateway_pb2.Message(
role="system",
content="You are a helpful assistant."
),
ai_gateway_pb2.Message(
role="user",
content="What is the capital of France?"
)
]
)
try:
# Make the call
response = stub.ChatCompletion(request)
# ...
Use x-llm-provider
HTTP header to route to different providers. AI Gateway mask request format differences between providers. Always use OpenAI API compatible request format and the gateway will always respond in OpenAI compatible response
LLM Provider | Header name | Header value |
---|---|---|
OpenAI | x-llm-provider | openai |
Ollama | x-llm-provider | ollama |
Anthropic | x-llm-provider | anthropic |
Gemini | x-llm-provider | gemini |
Mistral | x-llm-provider | mistral |
Cohere | x-llm-provider | cohere |
Gateway automatiacally enable response caching to improve performance and save costs. The default cache duration is 1 hour. If you specifically wants to disable caching for equests, then send Cache-Control: no-cache
HTTP header with each request
Gateway configuration can be done using either the built-in admin UI or using the REST API
Main Admin UI display current stats on the server

Configure Settings: system prompt, guardrails, and clear cache

Add/modify logging config

Add/modify rate limiting policy

curl --location 'http://localhost:8081/admin/ratelimit' \
--header 'Content-Type: application/json' \
--data '{
"name": "basic",
"requestsPerWindow": 5,
"windowSeconds": 60
}'
Once rate limiting is enbaled, following 3 HTTP response headers will be used to announce current limits. These will be added to every HTTP response generated by the gateway
Header name | Value | Description |
---|---|---|
RateLimit-Limit | number | Maximum number of requests allowed in the current policy |
RateLimit-Remaining | number | Number of requests that can be sent before rate limit policy is enforced |
RateLimit-Reset | number | How many seconds until current rate limit policy is reset |
Following GET call will return the currently configured rate limiting policy. If the request is empty then rate limiting is disabled
curl --location 'http://localhost:8081/admin/ratelimit' \
--data ''
Respnose
{
"name": "basic",
"requestsPerWindow": 5,
"windowSeconds": 60
}
When 2 or more LLM providers are configured, the gateway will attempt automatic failover if there's no successful response from the provider user has chosen through x-llm-provider
header.
The logs will dispaly a trail of failover like below. Here, the user is trying to send the request to Ollama. We have Ollama and OpenAI configured in the gateway.
First we can see a failed message. Following logs are formatted for clarity.
{
"timestamp": "2025-02-24T00:33:51.127868Z",
"level": "WARN",
"component": "failover",
"message": "Primary provider failed",
"metadata": {
"requestId": "01eff247-0444-1eb0-b153-61183107b722",
"provider": "ollama",
"error": "Something wrong with the connection:{}"
}
}
First attempt to failover,
{
"timestamp": "2025-02-24T00:33:51.129457Z",
"level": "INFO",
"component": "failover",
"message": "Attempting failover",
"metadata": {
"requestId": "01eff247-0444-1eb0-b153-61183107b722",
"provider": "openai"
}
}
Admins can use the admin API to inject a system prompt for all out going requests. This will be appended to the system prompt if a user has supplied one in the request
curl --location 'http://localhost:8081/admin/systemprompt' \
--header 'Content-Type: application/json' \
--data '{
"prompt": "respond only in chinese"
}'
Following GET request will show current system prompt
curl --location 'http://localhost:8081/admin/systemprompt'
Use the following API call to add guardrails
curl --location 'http://localhost:8081/admin/guardrails' \
--header 'Content-Type: application/json' \
--data '{
"bannedPhrases": ["obscene", "words"],
"minLength": 0,
"maxLength": 500000,
"requireDisclaimer": false
}'
Get currently configured guardrails
curl --location 'http://localhost:8081/admin/guardrails' \
--data ''
Gateway automatically enbale response caching for requests to save costs and enable responsiveness. Default cache duration is 1 hour. When requests are served from the cache, there will be a respective log printed to the logs.
The gateway will look for Cache-Control: no-cache
header and will disable cache lookup for those requests
View current cached contents
curl --location 'http://localhost:8081/admin/cache'
Clear cache
curl --location --request DELETE 'http://localhost:8081/admin/cache'
Configure following attributes in Config.toml
to configure log publishing to Elastic Search
[defaultLoggingConfig]
enableElasticSearch = true
elasticSearchEndpoint = "http://localhost:9200"
elasticApiKey = "T2FtMks1VUIzVG..."
After that at the server start, you should see an index being created in Elastic Search called "ai-gateway"
All ongoing logs will gets published to this index
Following is a complete example of all the configuration possible in the main gateway config file. At least one LLM provider config is mandatory
Create a Config.toml
file:
[defaultLoggingConfig]
enableElasticSearch = false
elasticSearchEndpoint = "http://localhost:9200"
elasticApiKey = ""
enableSplunk = false
splunkEndpoint = ""
enableDatadog = false
datadogEndpoint = ""
[openAIConfig]
apiKey="your-api-key"
endpoint="https://api.openai.com"
model="gpt-4o"
[anthropicConfig]
apiKey="your-api-key"
model="claude-3-5-sonnet-20241022"
endpoint="https://api.anthropic.com"
[geminiConfig]
apiKey="your-api-key"
model="gemini-pro"
endpoint="https://generativelanguage.googleapis.com/v1/models"
[ollamaConfig]
apiKey=""
model="llama3.2"
endpoint="http://localhost:11434"
[mistralConfig]
apiKey = ""
model = "mistral-small-latest"
endpoint = "https://api.mistral.ai"
[cohereConfig]
apiKey = ""
model = "command-r-plus-08-2024"
endpoint = "https://api.cohere.com"
# Build and run the gateway
% bal run