codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
Notes›ai

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

Published 2026-04-28· Updated 2026-05-18·0 views

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

The trend of running large language models directly on personal computers has taken hold rapidly since 2023. Where there used to be only cloud APIs, the combination of quantization formats like GGUF, inference runtimes like llama.cpp, and tools like LM Studio · Ollama · vLLM has made it possible to run reasonably large models even on a laptop.

1. About the tools

llama.cpp — A C/C++ inference runtime released by Georgi Gerganov in March 2023. It started as a way to run Meta's LLaMA weights on CPU. Since then it has expanded with GPU acceleration (CUDA · Metal · Vulkan · ROCm), a wider model lineup, and the GGUF format standardization. Many other tools use llama.cpp as their internal engine.

LM Studio — A desktop app released by Element Labs in 2023. On Windows · macOS · Linux it provides a GUI to search, download, and run inference. Internally it uses both llama.cpp and the MLX (Apple Silicon) backend. A strong point is the OpenAI-compatible local server (http://localhost:1234/v1) — turn it on and other apps connect as is. Free for non-commercial use; commercial use has a separate licensing policy to check.

Ollama — A CLI-centric tool released in 2023. The simplicity of ollama run llama3.2 pulls and runs a model in one line. It has its own server (default 127.0.0.1:11434) and an OpenAI-compatible endpoint. The Modelfile shape resembles Docker.

vLLM — A serving engine released by UC Berkeley Sky Computing Lab in 2023. It pushed up throughput with a KV cache management technique called PagedAttention. It sits in a different place than LM Studio · Ollama, which are for a single user running light inference — vLLM targets many concurrent requests · high-throughput serving, and a GPU is essentially a prerequisite.

2. GGUF and quantization

GGUF (GPT-Generated Unified Format) is the successor to GGML, a format that settled in around August 2023. It packs model weights + metadata (tokenizer · architecture) into a single file. Quantization is the technique of reducing weights from 16/32-bit floats to fewer bits.

Notation Bits Note
Q2_K 2~3 Smallest · large quality loss
Q4_0 / Q4_K_M 4~5 Common compromise
Q5_K_M 5~6 Quality·size balance
Q6_K 6~7 Good quality
Q8_0 8 Small loss · half size
F16 / BF16 16 Close to full precision

Variants with K are K-quants — block-wise quantization that improves quality at the same bit count. _M · _S mean medium · small.

3. Model size vs memory

Approximate RAM · VRAM footprint for a 7B model (grows further depending on context length · KV cache):

Model size F16 Q4_K_M Q8_0
7B 1314 GB 45 GB 78 GB
13B 2426 GB 78 GB 1314 GB
70B ~130 GB+ 3842 GB ~70 GB+

Exact numbers vary by model · tokenizer · implementation, so refer to the model card and measurements.

4. Inference backends

  • CUDA — NVIDIA GPU. Best supported by most tools.
  • Metal — Apple Silicon (M1/M2/M3/M4). Thanks to Unified Memory, the GPU can use a large portion of system RAM.
  • ROCm — AMD GPU. Support is gradually expanding.
  • Vulkan — General purpose. Performance is usually lower than CUDA · Metal.
  • CPU — Slowest, but runs anywhere.

5. Other tools

Tool Position Trait
LM Studio Desktop GUI Search · download · local server in one screen.
Ollama CLI · background daemon Simple with ollama run.
llama.cpp Library · binary Lowest layer.
Jan Desktop GUI Open-source LM Studio alternative.
GPT4All Desktop GUI Nomic-led · own model ecosystem.
vLLM Server engine High throughput · multi-user · GPU required.
TGI Server engine Hugging Face's serving tool.
MLX Apple Silicon framework Released by Apple in 2023.
llamafile Single executable Mozilla's Justine Tunney.

6. OpenAI-compatible server

LM Studio · Ollama · vLLM all provide OpenAI-compatible endpoints. Client code can stay almost untouched — only the base URL changes.

export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=anything

In Windows PowerShell: $env:OPENAI_BASE_URL = "...".

7. The thread of model selection

  • 7~8B — Practical on laptops · consumer GPUs. Fine for general dialogue · summarization.
  • 13~14B — One step up. 16 GB+ VRAM recommended.
  • 30~34B — 24 GB+ VRAM, or a Mac with large unified memory.
  • 70B+ — Datacenter-class or multiple GPUs.

Context length and KV cache — as context grows, the KV cache occupies memory separately from model size. Increasing 8k → 32k → 128k makes the cache grow more than proportionally and is a common cause of OOM.

8. Spots where you often get stuck

Model license differences — Llama · Gemma · Qwen · Mistral all have different licenses. Check commercial use · redistribution conditions on the model card.

Tokenizer mismatch — Cases where outputs go off due to tokenizer differences during GGUF conversion. Use the official conversion when possible.

Quantization limits — Low-bit quantization like Q2 · Q3 shows pronounced quality loss on small models. The observation is that larger models suffer less loss at the same bit count, relatively.

GPU memory + system memory split — When everything doesn't fit on the GPU, parts spill to CPU and speed drops sharply.

Driver · CUDA version — If the NVIDIA driver · CUDA version falls outside what the tool expects, GPU acceleration drops out and execution proceeds on CPU.

Exposing the local server externally — Default binding is usually 127.0.0.1. When exposing externally, handle authentication · firewall separately.

Difference between benchmarks and felt experience — Real usage after quantization can differ from short benchmarks. Compare directly with your own domain inputs.

Closing thoughts

Local LLMs are appealing for zero cost · privacy · internet independence. 7B Q4_K_M is the practical starting point on laptops · consumer GPUs. With large models + large context, the KV cache decides memory, so trimming context length down to what the domain actually requires is the standard flow.

Next

  • rag-pgvector
  • prompt-design

References: llama.cpp GitHub · LM Studio · Ollama · vLLM · GGUF spec · Apple MLX · NVIDIA CUDA.

More in ai

All in this category →
  • Google NotebookLM — source-grounded Gemini notebook (RAG-shaped tool)
  • Google AI Studio — Gemini-powered AI Web IDE + app builder
  • LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing
  • AI Agents — Definition · Patterns · Frameworks · Autonomy
  • Embeddings Deep — Models · Dimensions · Benchmarks · Cache
  • Gemini — Google's Multimodal LLM Lineup