codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Local LLM · pgvector · building a RAG chatbot›Step 5

Step 5

Gemini · OpenAI-compatible APIs

0 views

Gemini · OpenAI-compatible APIs

Unify LM Studio, Gemini, OpenAI, Anthropic behind the OpenAI-compatible interface and switch with a single env var.

1. The de-facto standard

POST /v1/chat/completions schema — supported by LM Studio, Ollama, Gemini, Groq, Together, and more.

2. Single client abstraction

from openai import OpenAI

def make_client(provider):
    if provider == "lmstudio":
        return OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
    if provider == "gemini":
        return OpenAI(
            base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
            api_key=os.environ["GEMINI_API_KEY"],
        )
    if provider == "openai":
        return OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    raise ValueError(provider)

client = make_client(os.environ.get("LLM_PROVIDER", "lmstudio"))

3. Model name mapping

Provider Chat model Embedding
LM Studio gemma-2-9b-it separate model
Gemini gemini-1.5-flash · gemini-2.0-flash-exp models/text-embedding-004
OpenAI gpt-4o-mini · gpt-4o text-embedding-3-small
Groq llama-3.1-70b-versatile —

4. Cost · latency (rough 2026)

Provider In 1M tok Out 1M tok p50
Gemini 1.5 flash $0.075 $0.30 500ms
GPT-4o mini $0.15 $0.60 600ms
Claude Haiku $0.25 $1.25 700ms
Local Gemma 9B (GPU) $0 $0 100–300ms

5. Streaming

stream = client.chat.completions.create(
    model=model, messages=[...], stream=True,
)
for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta: yield delta

Pair with FastAPI StreamingResponse.

6. Fallback

async def chat_with_fallback(messages):
    for p in ["lmstudio", "gemini", "openai"]:
        try:
            return make_client(p).chat.completions.create(model=MODEL_MAP[p]["chat"], messages=messages)
        except Exception as e:
            logger.warning(f"{p} failed: {e}")
    raise RuntimeError("all providers failed")

Local → free quota → paid.

7. Gotchas

  • Model typos — use client.models.list()
  • Token limits differ per provider (LM Studio at load time)
  • Sync vs async — openai ships AsyncOpenAI separately
  • API key leakage — env vars only, never logs

Closing

The OpenAI-compatible interface reduces vendor lock-in. One env var switches between dev, free quota, and production.

Next

  • 06-prompt-design

← Step 4

RAG pipeline

Step 6 →

Prompt design