codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
Notes›environment

Text encoding and line endings

Published 2026-04-28· Updated 2026-05-18·0 views

Text encoding and line endings

Even for the same character, the bytes a computer stores depend on the era and environment. As a result, Korean text breaks or git diff fills with meaningless line changes.

1. A short history of encoding

ASCII (1963) — American Standard Code for Information Interchange. 7-bit (0-127). English letters, digits, basic symbols. The bedrock of English text but unable to represent Korean, Japanese, or Chinese.

EUC-KR · CP949 — EUC-KR is an 8-bit encoding for Korean. KS X 1001 hanja and Hangul ranges in 2 bytes. Even so, EUC-KR's Hangul range is limited to KS X 1001's 2,350 syllables, leaving some out. CP949 (code page 949) is Microsoft's extension of EUC-KR carrying all 11,172 Hangul syllables. The Korean Windows ANSI code page is CP949.

Unicode (1991) and UTF-8 (1993) — the unified character set the Unicode Consortium published in 1991. Every human character receives a code point (Hangul '가' = U+AC00). UTF-8 is the variable-length encoding Ken Thompson and Rob Pike designed in 1993:

  • The ASCII range (0-127) takes 1 byte and is fully ASCII compatible.
  • Hangul usually takes 3 bytes.
  • Supplementary plane characters such as emoji take 4 bytes.

UTF-8's strengths are ASCII compatibility, self-synchronization (decoding can start mid-stream and find boundaries), and endian independence. The web, Unix, and most modern tools default to UTF-8. W3C has recommended UTF-8 in HTML since 2008.

UTF-16 — the 16-bit-unit encoding used inside Java, .NET, Windows, and some older systems. Hangul is 2 bytes, emoji is 4 (surrogate pairs). Endianness (LE/BE) is its own concern.

2. BOM (Byte Order Mark)

A BOM is a marker byte at the start of a file that announces the encoding and endianness.

Encoding BOM bytes
UTF-8 EF BB BF
UTF-16 LE FF FE
UTF-16 BE FE FF

UTF-8 has no endianness so a BOM is not required, but some tools rely on it for encoding detection:

  • Microsoft Excel — opening a CSV on Korean Windows guesses ANSI (CP949). UTF-8 CSVs commonly break, so saving with a BOM lets Excel recognize UTF-8.
  • Where it hurts — a BOM at the top of a shell script breaks the shebang (#!/usr/bin/env bash). Some JSON parsers also struggle with BOMs.

3. Line endings

Notation Bytes Origin
LF 0A Unix family, modern macOS, Linux, Web
CR 0D Old Mac (System 9 and earlier). Nearly extinct.
CRLF 0D 0A DOS, Windows, some network protocols (HTTP and SMTP headers)

CR (Carriage Return) and LF (Line Feed) come from mechanical teletypes — "carriage to the left edge" and "advance one line." DOS chose CRLF, which sends both. Unix simplified to a single LF.

Most modern tools handle either side automatically. Differences surface in specific places: git diff, shell script execution, some JSON or YAML parsers, Python's open(..., newline='').

4. git's line-ending handling

core.autocrlf:

Value Behavior
true LF→CRLF on checkout, CRLF→LF on commit (recommended on Windows).
input CRLF→LF on commit only (recommended on macOS, Linux).
false No conversion.

.gitattributes (recommended SSOT) — declared at the repository level. Takes precedence over core.autocrlf:

* text=auto
*.sh text eol=lf
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf
*.png binary
*.jpg binary

text=auto lets git normalize files it identifies as text. eol=lf or eol=crlf enforces line endings regardless of OS. Mark binaries as binary.

5. PowerShell 5.1 encoding traps

Windows PowerShell 5.1's output cmdlets (Out-File, Set-Content, >) default to UTF-16 LE (with BOM). Files turn out broken in environments expecting UTF-8.

# Safe pattern under 5.1
"hello" | Out-File -Encoding utf8 file.txt
[System.IO.File]::WriteAllText("file.txt", "hello", [System.Text.UTF8Encoding]::new($false))  # UTF-8 without BOM

PowerShell 7+ defaults to UTF-8 (no BOM), so the issue largely disappears.

6. The VS Code indicator

The VS Code status bar's bottom-right shows the current file's encoding and line ending:

  • Encoding — UTF-8, UTF-8 with BOM, Windows 1252.
  • Line ending — LF, CRLF.

Clicking opens convert and reinterpret options. "Save with Encoding" rewrites the file, "Reopen with Encoding" reads a broken file with the right encoding.

{
  "files.encoding": "utf8",
  "files.eol": "\n"
}

7. Common commands

Task macOS · Linux Windows
Check file encoding file -I path (PowerShell) Get-Content path -Encoding Byte -TotalCount 3
Force convert to LF dos2unix path dos2unix path (Git Bash)
Force convert to CRLF unix2dos path unix2dos path
Check git encoding git ls-files --eol Same

8. Common pitfalls

A .sh saved as CRLF gets \r appended to the shebang's interpreter, so the system tries to run bash\r and fails.

Parsing UTF-8 JSON with a BOM via the standard library errors out in some languages (an older era of JSON.parse rejected BOM).

Korean breakage in Excel UTF-8 CSV — add a BOM, or use the "Data → Get External Data" menu to specify the encoding.

Notepad on Windows saving as UTF-16 LE so Linux tools fail — Notepad on Windows 10 1903+ defaults to UTF-8 (no BOM).

Python's open(path, "r") defaults the encoding to the OS locale. On Korean Windows that may be cp949, so pass encoding="utf-8" explicitly.

Closing thoughts

Once UTF-8 + LF + .gitattributes settle in as the SSOT trio, encoding and line-ending accidents from OS differences nearly disappear. The two spots still worth babysitting are PowerShell 5.1's UTF-16 LE default and Excel's CP949 guesswork.

Next

  • first-terminal-day
  • data-formats

The Unicode Standard · RFC 3629 UTF-8 · git-scm gitattributes · Microsoft Code Pages · Wikipedia Newline for reference.

More in environment

All in this category →
  • WSL2 — Linux on top of Windows
  • Data formats — JSON · YAML · TOML · XML
  • First day with the terminal
  • Markdown
  • Cross-platform scripts
  • cmd.exe and batch files