codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Local LLM · pgvector · building a RAG chatbot›Step 1

Step 1

Why local LLMs · getting started with LM Studio

0 views

Why local LLMs · getting started with LM Studio

A one-line ChatGPT call is fast and easy. Still, there are places where a local LLM is the answer.

1. Four places local wins

  • Data cannot leave — internal docs, health, finance
  • Per-request cost adds up — dozens of calls per second in backends
  • Predictable latency — cloud tail latencies hit 500ms+
  • Offline · personal device — AI baked into a Tauri desktop app

Quality and context length still favour Claude Opus or GPT-4 class models.

2. LM Studio — the standard local launcher

Free, macOS / Windows / Linux. Pick a GGUF and run Gemma, Llama, Qwen, Mistral.

# Download LM Studio
# Search → gemma-2-9b-it · llama-3.2-3b · qwen2.5-coder
# Load Model → Server tab → Start Server (default http://localhost:1234)

3. OpenAI-compatible endpoint

Call it with the OpenAI SDK as-is.

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
resp = client.chat.completions.create(
    model="gemma-2-9b-it",
    messages=[{"role": "user", "content": "Answer briefly: 1 + 1 = ?"}],
    temperature=0.3,
)

Swap base_url + model to switch between cloud and local.

4. VRAM guide

Params Quant VRAM
3B Q4_K_M 4 GB
7 ~ 9B Q4_K_M 8 ~ 12 GB
14B Q4_K_M 16 GB
32B Q4_K_M 24 GB +

CPU-only works but generates 1–5 tok/s. Use GPU for realtime.

5. Picking a model

  • Code · RAG summary — Qwen2.5-Coder · Gemma 2 9B
  • Korean quality — Gemma 2 9B · Gemma 4 e2b-it (2026)
  • Low VRAM — Llama 3.2 3B · Phi-3 mini

Start with Gemma 2 9B Q4_K_M.

6. Gotchas

  • Model name mismatch — use id returned by curl /v1/models
  • Temperature too high — RAG 0.1–0.4, creative 0.7–1.0
  • Context accumulates — no auto-trim across calls; trim manually

Closing

Start your first RAG against Gemini or OpenAI to validate the flow, then switch to local. Local is not a silver bullet; being able to switch on demand is the real win.

Next

  • 02-embeddings

Step 2 →

Embeddings — text to vectors