Local LLM — LM Studio · llama.cpp · Ollama · vLLM

The trend of running large language models directly on personal computers has taken hold rapidly since 2023. Where there used to be only cloud APIs, the combination of quantization formats like GGUF, inference runtimes like llama.cpp, and tools like LM Studio · Ollama · vLLM has made it possible to run reasonably large models even on a laptop.

1. About the tools

llama.cpp — A C/C++ inference runtime released by Georgi Gerganov in March 2023. It started as a way to run Meta's LLaMA weights on CPU. Since then it has expanded with GPU acceleration (CUDA · Metal · Vulkan · ROCm), a wider model lineup, and the GGUF format standardization. Many other tools use llama.cpp as their internal engine.

LM Studio — A desktop app released by Element Labs in 2023. On Windows · macOS · Linux it provides a GUI to search, download, and run inference. Internally it uses both llama.cpp and the MLX (Apple Silicon) backend. A strong point is the OpenAI-compatible local server (http://localhost:1234/v1) — turn it on and other apps connect as is. Free for non-commercial use; commercial use has a separate licensing policy to check.

Ollama — A CLI-centric tool released in 2023. The simplicity of ollama run llama3.2 pulls and runs a model in one line. It has its own server (default 127.0.0.1:11434) and an OpenAI-compatible endpoint. The Modelfile shape resembles Docker.

vLLM — A serving engine released by UC Berkeley Sky Computing Lab in 2023. It pushed up throughput with a KV cache management technique called PagedAttention. It sits in a different place than LM Studio · Ollama, which are for a single user running light inference — vLLM targets many concurrent requests · high-throughput serving, and a GPU is essentially a prerequisite.

2. GGUF and quantization

GGUF (GPT-Generated Unified Format) is the successor to GGML, a format that settled in around August 2023. It packs model weights + metadata (tokenizer · architecture) into a single file. Quantization is the technique of reducing weights from 16/32-bit floats to fewer bits.

Notation	Bits	Note
Q2_K	2~3	Smallest · large quality loss
Q4_0 / Q4_K_M	4~5	Common compromise
Q5_K_M	5~6	Quality·size balance
Q6_K	6~7	Good quality
Q8_0	8	Small loss · half size
F16 / BF16	16	Close to full precision

Variants with K are K-quants — block-wise quantization that improves quality at the same bit count. _M · _S mean medium · small.

3. Model size vs memory

Approximate RAM · VRAM footprint for a 7B model (grows further depending on context length · KV cache):

Model size	F16	Q4_K_M	Q8_0
7B	1314 GB	45 GB	78 GB
13B	2426 GB	78 GB	1314 GB
70B	~130 GB+	3842 GB	~70 GB+

Exact numbers vary by model · tokenizer · implementation, so refer to the model card and measurements.

4. Inference backends

CUDA — NVIDIA GPU. Best supported by most tools.
Metal — Apple Silicon (M1/M2/M3/M4). Thanks to Unified Memory, the GPU can use a large portion of system RAM.
ROCm — AMD GPU. Support is gradually expanding.
Vulkan — General purpose. Performance is usually lower than CUDA · Metal.
CPU — Slowest, but runs anywhere.

5. Other tools

Tool	Position	Trait
LM Studio	Desktop GUI	Search · download · local server in one screen.
Ollama	CLI · background daemon	Simple with `ollama run`.
llama.cpp	Library · binary	Lowest layer.
Jan	Desktop GUI	Open-source LM Studio alternative.
GPT4All	Desktop GUI	Nomic-led · own model ecosystem.
vLLM	Server engine	High throughput · multi-user · GPU required.
TGI	Server engine	Hugging Face's serving tool.
MLX	Apple Silicon framework	Released by Apple in 2023.
llamafile	Single executable	Mozilla's Justine Tunney.

6. OpenAI-compatible server

LM Studio · Ollama · vLLM all provide OpenAI-compatible endpoints. Client code can stay almost untouched — only the base URL changes.

export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=anything

In Windows PowerShell: $env:OPENAI_BASE_URL = "...".

7. The thread of model selection

7~8B — Practical on laptops · consumer GPUs. Fine for general dialogue · summarization.
13~14B — One step up. 16 GB+ VRAM recommended.
30~34B — 24 GB+ VRAM, or a Mac with large unified memory.
70B+ — Datacenter-class or multiple GPUs.

Context length and KV cache — as context grows, the KV cache occupies memory separately from model size. Increasing 8k → 32k → 128k makes the cache grow more than proportionally and is a common cause of OOM.

8. Spots where you often get stuck

Model license differences — Llama · Gemma · Qwen · Mistral all have different licenses. Check commercial use · redistribution conditions on the model card.

Tokenizer mismatch — Cases where outputs go off due to tokenizer differences during GGUF conversion. Use the official conversion when possible.

Quantization limits — Low-bit quantization like Q2 · Q3 shows pronounced quality loss on small models. The observation is that larger models suffer less loss at the same bit count, relatively.

GPU memory + system memory split — When everything doesn't fit on the GPU, parts spill to CPU and speed drops sharply.

Driver · CUDA version — If the NVIDIA driver · CUDA version falls outside what the tool expects, GPU acceleration drops out and execution proceeds on CPU.

Exposing the local server externally — Default binding is usually 127.0.0.1. When exposing externally, handle authentication · firewall separately.

Difference between benchmarks and felt experience — Real usage after quantization can differ from short benchmarks. Compare directly with your own domain inputs.

Closing thoughts

Local LLMs are appealing for zero cost · privacy · internet independence. 7B Q4_K_M is the practical starting point on laptops · consumer GPUs. With large models + large context, the KV cache decides memory, so trimming context length down to what the domain actually requires is the standard flow.

rag-pgvector
prompt-design

References: llama.cpp GitHub · LM Studio · Ollama · vLLM · GGUF spec · Apple MLX · NVIDIA CUDA.

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

1. About the tools

2. GGUF and quantization

Notation	Bits	Note
Q2_K	2~3	Smallest · large quality loss
Q4_0 / Q4_K_M	4~5	Common compromise
Q5_K_M	5~6	Quality·size balance
Q6_K	6~7	Good quality
Q8_0	8	Small loss · half size
F16 / BF16	16	Close to full precision

Variants with K are K-quants — block-wise quantization that improves quality at the same bit count. _M · _S mean medium · small.

3. Model size vs memory

Approximate RAM · VRAM footprint for a 7B model (grows further depending on context length · KV cache):

Model size	F16	Q4_K_M	Q8_0
7B	1314 GB	45 GB	78 GB
13B	2426 GB	78 GB	1314 GB
70B	~130 GB+	3842 GB	~70 GB+

Exact numbers vary by model · tokenizer · implementation, so refer to the model card and measurements.

4. Inference backends

CUDA — NVIDIA GPU. Best supported by most tools.
Metal — Apple Silicon (M1/M2/M3/M4). Thanks to Unified Memory, the GPU can use a large portion of system RAM.
ROCm — AMD GPU. Support is gradually expanding.
Vulkan — General purpose. Performance is usually lower than CUDA · Metal.
CPU — Slowest, but runs anywhere.

5. Other tools

Tool	Position	Trait
LM Studio	Desktop GUI	Search · download · local server in one screen.
Ollama	CLI · background daemon	Simple with `ollama run`.
llama.cpp	Library · binary	Lowest layer.
Jan	Desktop GUI	Open-source LM Studio alternative.
GPT4All	Desktop GUI	Nomic-led · own model ecosystem.
vLLM	Server engine	High throughput · multi-user · GPU required.
TGI	Server engine	Hugging Face's serving tool.
MLX	Apple Silicon framework	Released by Apple in 2023.
llamafile	Single executable	Mozilla's Justine Tunney.

6. OpenAI-compatible server

LM Studio · Ollama · vLLM all provide OpenAI-compatible endpoints. Client code can stay almost untouched — only the base URL changes.

export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=anything

In Windows PowerShell: $env:OPENAI_BASE_URL = "...".

7. The thread of model selection

7~8B — Practical on laptops · consumer GPUs. Fine for general dialogue · summarization.
13~14B — One step up. 16 GB+ VRAM recommended.
30~34B — 24 GB+ VRAM, or a Mac with large unified memory.
70B+ — Datacenter-class or multiple GPUs.

8. Spots where you often get stuck

Model license differences — Llama · Gemma · Qwen · Mistral all have different licenses. Check commercial use · redistribution conditions on the model card.

Tokenizer mismatch — Cases where outputs go off due to tokenizer differences during GGUF conversion. Use the official conversion when possible.

Quantization limits — Low-bit quantization like Q2 · Q3 shows pronounced quality loss on small models. The observation is that larger models suffer less loss at the same bit count, relatively.

GPU memory + system memory split — When everything doesn't fit on the GPU, parts spill to CPU and speed drops sharply.

Driver · CUDA version — If the NVIDIA driver · CUDA version falls outside what the tool expects, GPU acceleration drops out and execution proceeds on CPU.

Exposing the local server externally — Default binding is usually 127.0.0.1. When exposing externally, handle authentication · firewall separately.

Difference between benchmarks and felt experience — Real usage after quantization can differ from short benchmarks. Compare directly with your own domain inputs.

Closing thoughts

rag-pgvector
prompt-design

References: llama.cpp GitHub · LM Studio · Ollama · vLLM · GGUF spec · Apple MLX · NVIDIA CUDA.

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

1. About the tools

2. GGUF and quantization

3. Model size vs memory

4. Inference backends

5. Other tools

6. OpenAI-compatible server

7. The thread of model selection

8. Spots where you often get stuck

Closing thoughts

Next

More in ai

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

Local LLM — LM Studio · llama.cpp · Ollama · vLLM

1. About the tools

2. GGUF and quantization

3. Model size vs memory

4. Inference backends

5. Other tools

6. OpenAI-compatible server

7. The thread of model selection

8. Spots where you often get stuck

Closing thoughts

Next

More in ai