codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Building public-data crawlers

Building public-data crawlers

Build an ethical crawler in six steps with Playwright, http_utils, and APScheduler.

Start with Step 1 →
Difficulty
intermediate
Lessons
6

Building public-data crawlers

Public data like NPS, DART, and HIRA is accessible to everyone, but automation comes with rules — robots.txt, rate limits, terms of service. Six steps to an ethical and sustainable crawler.

Who it's for

  • Developers who need more control than portal APIs offer
  • Anyone who has been blocked by a crawl target
  • Teams who want incremental collection, schedules, and observability

What you can do afterwards

  • Separate dynamic pages (Playwright) from static ones (BS4)
  • Apply robots.txt + rate limit + backoff
  • Schedule in KST with APScheduler
  • Combine public APIs, ministry CSVs, and web scraping
  • Incremental collection, dedup, checkpoints
  • Healthchecks and failure alerts

Flow

[1] Ethics·law ──▶ [2] Tool choice ──▶ [3] Rate limit ──▶ [4] Schedule
                                                              │
                                                              ▼
                                  [6] Observability ◀── [5] Incremental·dedup

The goal of a crawler: refresh our DB without loss while not trespassing on the source site. The flow above strengthens both axes in turn.

Steps

  1. Crawler ethics and legal boundaries — robots.txt · terms · personal data
  2. Static vs dynamic — BS4 + Playwright — pick the right tool
  3. Rate limiting · retries · backoff — exponential + jitter
  4. APScheduler + KST — idempotency · replace_existing=True · double-trigger defence
  5. Incremental collection · deduplication — checkpoints · unique keys · change detection
  6. Observability · alerts — success rate · latency · Slack · PagerDuty

Prerequisites — complete python-data-pipeline.

Lessons

  1. 1

    Crawler ethics and legal boundaries

    →
  2. 2

    Static vs dynamic — BS4 + Playwright

    →
  3. 3

    Rate limit · retries · backoff

    →
  4. 4

    APScheduler + KST schedules

    →
  5. 5

    Incremental collection · deduplication

    →
  6. 6

    Observability · alerts

    →

Other courses

All courses →
  • Getting Started with a Dev Environment
  • From HTML/CSS/JS to React, Next.js, Tailwind
  • Build Your First Fullstack App with Next.js 16
  • Backend with Spring Boot 4
  • Python · FastAPI · Data Pipelines
  • AI-native developer tooling — Claude Code · MCP · design tools
  • Docker · Caddy · Cloud — 10 deploy options
  • Central admin platform — many domains behind one hub
  • Local LLM · pgvector · building a RAG chatbot
  • Tauri 2 — desktop · mobile in one codebase
  • Testing strategy and quality gates
  • Web security foundations — JWT · OAuth · OWASP
  • PostgreSQL in depth + Redis · Kafka
  • Monorepo · SSOT · layer separation thinking