codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
Notes›ai

LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing

Published 2026-04-28· Updated 2026-05-18·0 views

LLM Landscape — Closed · Open · Korean-Specialized · Evaluation · Pricing

The LLM market shifts fast. Closed-API and open-weight, English-centric and multilingual, cloud and self-hosted, models specialized for Korean — all sit alongside each other.

1. Closed (API · weights private)

Provider Representative models First release
OpenAI GPT-3.5 · GPT-4 · GPT-4o · o1 · o3 ChatGPT 2022-11-30.
Anthropic Claude · Claude 2 · 3 · 3.5 · 4 series Claude 2023-03.
Google DeepMind Gemini 1.0 · 1.5 · 2.0 · 2.5 Gemini 2023-12-06.
Mistral AI Mistral Large · Pixtral 2023~.
Cohere Command R · R+ 2021~.
xAI Grok series 2023-11.

Even within the same provider, model capability shifts quickly across generations · dates.

2. Open weights

Model families whose weights can be downloaded and run for inference. License conditions differ per model.

Model family Origin Note
Llama 2 / 3 / 3.1 / 3.2 / 3.3 Meta Custom license (conditional commercial).
Mistral · Mixtral · Codestral Mistral AI Mix of Apache 2.0 variants and non-commercial variants.
Gemma · Gemma 2 / 3 Google Gemma license.
Qwen / Qwen2 / Qwen2.5 / Qwen3 Alibaba Many Apache 2.0 variants.
DeepSeek (V2 · V3 · R1) DeepSeek License conditions vary per model.
Phi series Microsoft Known for small size.
Yi series 01.AI 2023~.
Falcon TII (UAE) 2023~.
OLMo Allen AI Aims to open even the training data.
StableLM · StableCode Stability AI 2023~.

The degree of "open" varies per model. There's a difference between weights-only public, training code public, and even training data public. Hence the view that "open weights" is more accurate than "open source."

3. Korean-specialized · Korean-company models

Model Origin Note
HyperCLOVA X Naver Released 2023. Self-trained Korean LLM.
A.X (Adot X) SK Telecom Self-developed Korean model family.
Solar Upstage Open-weight variant published.
EXAONE LG AI Research Some open-weight variants published.
KoAlpaca · Polyglot-Ko Community Korean fine-tuning attempts.

Korean ability holds more meaning per-model. The observation is that even the same global model can vary widely in Korean across generations.

4. Reasoning models · multimodal · context length

Reasoning models — From late 2024, the trend of OpenAI o1 · o3, DeepSeek R1, Claude's extended thinking, Gemini 2.5's thinking mode. The model goes through longer internal reasoning before responding, spending more tokens · time accordingly.

Multimodal — Models that take images · audio · video · documents as input alongside text are now standard. GPT-4o · Gemini · Claude 3.x.

Context length expansion:

Model Context
GPT-4 (initial) 8k · 32k
GPT-4-Turbo / GPT-4o 128k
Claude 3 / 3.5 200k
Gemini 1.5 Pro 1M (at release)

A larger context isn't always the answer. Position effects like "lost in the middle" come along with cost · latency.

5. Evaluation sites

Site Operator Trait
LMArena LMSYS · UC Berkeley Human blind comparison of two models → Elo.
LiveBench Abacus.AI Periodically refreshed eval set (mitigates data leakage).
MMLU Hendrycks et al. 2020 Multi-subject multiple choice.
BigBench / BBH Google Research Collection of various hard tasks.
HumanEval / MBPP OpenAI · Google Standard for coding eval.
SWE-bench Princeton Real GitHub-issue resolution rate.
GAIA Hugging Face · Meta General assistant tasks.
Open LLM Leaderboard Hugging Face Composite for open-weight models.

Limits of evaluation:

  • Suspicion of training-data leakage (cases where benchmarks ended up in training data).
  • Many evaluations are English-centric.
  • A single score doesn't directly tie to your domain performance.

6. Pricing models

Per-token billing (API) — Most closed models price input · output tokens separately. Output tokens are usually more expensive. With the introduction of context caching · prompt caching, cached input gets a discount.

Cost per request ≈ (input tokens × input rate) + (output tokens × output rate)

Subscription model (consumer) — ChatGPT Plus / Team / Enterprise · Claude Pro / Team · Gemini Advanced · Perplexity Pro. Bundles UI · quota · extra features.

Self-hosted — Open weights + your own GPU or cloud GPU. Per-call cost disappears, but GPU time · MLOps people · model updates · evaluation · operational burden grow. For small light workloads, API; for high traffic · strong data control needs, self-hosted. The threshold is per-workload.

Data usage policy — Even within the same provider, free tier · paid API · enterprise policies differ. Check the terms and the model card every time.

7. The thread of selection

  • Fast and cheap, in volume — GPT-4o-mini / Claude Haiku / Gemini Flash / small open models.
  • Quality first — GPT-4 / Claude Sonnet · Opus / Gemini Pro / large open models.
  • Reinforced reasoning — o1 · o3 / Claude extended thinking / Gemini Thinking / DeepSeek R1.
  • On-device · privacy — Small variants of Llama · Gemma · Phi · Qwen + LM Studio · Ollama.
  • Heavy Korean — Korean-specialized models · multilingual-strong global models, with your own domain evaluation.

8. Spots where you often get stuck

Aliasing volatility — Aliases like gpt-4 · gemini-1.5-pro-latest point at different models depending on the time. Operations should pin to dates.

Benchmark over-trust — #1 isn't #1 in your domain.

License differences — Open weights doesn't mean all are commercially usable. Check the model card.

Data training use — Free / paid / enterprise can have different policies. Don't let sensitive info into the input.

Generation-change regression — A new model doesn't beat the old in every aspect. Sometimes regression appears in your tasks.

Advertised vs actual context length — Sometimes the advertised limit and the per-model input · output limit differ.

Tokens of reasoning models — Whether thinking tokens are included in the response or billed separately differs per provider.

"AGI" · "superhuman" expressions — Marketing expressions should be filtered out when interpreting evaluation results.

Closing thoughts

The LLM landscape shifts fast, so single-model dependence in operation drags regression risk. Pin the model + your own domain evaluation set + a shape that lets you swap models with one environment-variable line + cost monitoring — these four spots are the standard for stable operation.

Next

  • (end of ai)

References: LMArena · LiveBench · Open LLM Leaderboard · OpenAI Models · Anthropic Models · Gemini Models · Meta Llama · Mistral · DeepSeek.

More in ai

All in this category →
  • Google NotebookLM — source-grounded Gemini notebook (RAG-shaped tool)
  • Google AI Studio — Gemini-powered AI Web IDE + app builder
  • AI Agents — Definition · Patterns · Frameworks · Autonomy
  • Embeddings Deep — Models · Dimensions · Benchmarks · Cache
  • Gemini — Google's Multimodal LLM Lineup
  • Prompt Design — Message Roles · CoT · ReAct · Sampling · Injection