Building public-data crawlers
Build an ethical crawler in six steps with Playwright, http_utils, and APScheduler.
- Difficulty
- intermediate
- Lessons
- 6
Building public-data crawlers
Public data like NPS, DART, and HIRA is accessible to everyone, but automation comes with rules — robots.txt, rate limits, terms of service. Six steps to an ethical and sustainable crawler.
Who it's for
- Developers who need more control than portal APIs offer
- Anyone who has been blocked by a crawl target
- Teams who want incremental collection, schedules, and observability
What you can do afterwards
- Separate dynamic pages (Playwright) from static ones (BS4)
- Apply robots.txt + rate limit + backoff
- Schedule in KST with APScheduler
- Combine public APIs, ministry CSVs, and web scraping
- Incremental collection, dedup, checkpoints
- Healthchecks and failure alerts
Flow
[1] Ethics·law ──▶ [2] Tool choice ──▶ [3] Rate limit ──▶ [4] Schedule
│
▼
[6] Observability ◀── [5] Incremental·dedup
The goal of a crawler: refresh our DB without loss while not trespassing on the source site. The flow above strengthens both axes in turn.
Steps
- Crawler ethics and legal boundaries — robots.txt · terms · personal data
- Static vs dynamic — BS4 + Playwright — pick the right tool
- Rate limiting · retries · backoff — exponential + jitter
- APScheduler + KST — idempotency ·
replace_existing=True· double-trigger defence - Incremental collection · deduplication — checkpoints · unique keys · change detection
- Observability · alerts — success rate · latency · Slack · PagerDuty
Prerequisites — complete python-data-pipeline.
Lessons
Other courses
All courses →- Getting Started with a Dev Environment
- From HTML/CSS/JS to React, Next.js, Tailwind
- Build Your First Fullstack App with Next.js 16
- Backend with Spring Boot 4
- Python · FastAPI · Data Pipelines
- AI-native developer tooling — Claude Code · MCP · design tools
- Docker · Caddy · Cloud — 10 deploy options
- Central admin platform — many domains behind one hub
- Local LLM · pgvector · building a RAG chatbot
- Tauri 2 — desktop · mobile in one codebase
- Testing strategy and quality gates
- Web security foundations — JWT · OAuth · OWASP
- PostgreSQL in depth + Redis · Kafka
- Monorepo · SSOT · layer separation thinking