Building public-data crawlers

Public data like NPS, DART, and HIRA is accessible to everyone, but automation comes with rules — robots.txt, rate limits, terms of service. Six steps to an ethical and sustainable crawler.

Who it's for

Developers who need more control than portal APIs offer
Anyone who has been blocked by a crawl target
Teams who want incremental collection, schedules, and observability

What you can do afterwards

Separate dynamic pages (Playwright) from static ones (BS4)
Apply robots.txt + rate limit + backoff
Schedule in KST with APScheduler
Combine public APIs, ministry CSVs, and web scraping
Incremental collection, dedup, checkpoints
Healthchecks and failure alerts

Flow

[1] Ethics·law ──▶ [2] Tool choice ──▶ [3] Rate limit ──▶ [4] Schedule
                                                              │
                                                              ▼
                                  [6] Observability ◀── [5] Incremental·dedup

The goal of a crawler: refresh our DB without loss while not trespassing on the source site. The flow above strengthens both axes in turn.

Steps

Crawler ethics and legal boundaries — robots.txt · terms · personal data
Static vs dynamic — BS4 + Playwright — pick the right tool
Rate limiting · retries · backoff — exponential + jitter
APScheduler + KST — idempotency · replace_existing=True · double-trigger defence
Incremental collection · deduplication — checkpoints · unique keys · change detection
Observability · alerts — success rate · latency · Slack · PagerDuty

Prerequisites — complete python-data-pipeline.

Building public-data crawlers

Public data like NPS, DART, and HIRA is accessible to everyone, but automation comes with rules — robots.txt, rate limits, terms of service. Six steps to an ethical and sustainable crawler.

Who it's for

Developers who need more control than portal APIs offer
Anyone who has been blocked by a crawl target
Teams who want incremental collection, schedules, and observability

What you can do afterwards

Separate dynamic pages (Playwright) from static ones (BS4)
Apply robots.txt + rate limit + backoff
Schedule in KST with APScheduler
Combine public APIs, ministry CSVs, and web scraping
Incremental collection, dedup, checkpoints
Healthchecks and failure alerts

Flow

[1] Ethics·law ──▶ [2] Tool choice ──▶ [3] Rate limit ──▶ [4] Schedule
                                                              │
                                                              ▼
                                  [6] Observability ◀── [5] Incremental·dedup

The goal of a crawler: refresh our DB without loss while not trespassing on the source site. The flow above strengthens both axes in turn.

Steps

Crawler ethics and legal boundaries — robots.txt · terms · personal data
Static vs dynamic — BS4 + Playwright — pick the right tool
Rate limiting · retries · backoff — exponential + jitter
APScheduler + KST — idempotency · replace_existing=True · double-trigger defence
Incremental collection · deduplication — checkpoints · unique keys · change detection
Observability · alerts — success rate · latency · Slack · PagerDuty

Prerequisites — complete python-data-pipeline.

Building public-data crawlers

Building public-data crawlers

Who it's for

What you can do afterwards

Flow

Steps

Lessons

Crawler ethics and legal boundaries

Static vs dynamic — BS4 + Playwright

Rate limit · retries · backoff

APScheduler + KST schedules

Incremental collection · deduplication

Observability · alerts

Other courses

Building public-data crawlers

Building public-data crawlers

Who it's for

What you can do afterwards

Flow

Steps

Lessons

Crawler ethics and legal boundaries

Static vs dynamic — BS4 + Playwright

Rate limit · retries · backoff

APScheduler + KST schedules

Incremental collection · deduplication

Observability · alerts

Other courses