codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
Notes›backend

Crawler ethics and tooling

Published 2026-04-28· Updated 2026-05-18·0 views

Crawler ethics and tooling

Collecting public web data touches not only technology but ethics and law. Hitting too often burdens the target server, and skipping the terms of service can spill into legal trouble.

1. robots.txt

A text-based protocol that tells web crawlers what is allowed and disallowed. Proposed by Martijn Koster in 1994, it served as a de facto standard for years and became an official IETF standard as RFC 9309 in September 2022.

User-agent: *
Disallow: /private/
Allow: /private/public.html
Crawl-delay: 5
Sitemap: https://example.com/sitemap.xml

robots.txt is a promise rather than a legal mandate, but ignoring it can become grounds for blocking and legal disputes.

2. Crawling ethics

  • Identify the User-Agent — a UA that identifies who collects what for what purpose, plus a contactable URL.
  • Rate limit — cap concurrent connections and per-second requests. If Crawl-delay is provided, follow it.
  • Use caches — send ETag and Last-Modified to receive 304 responses.
  • Check terms — verify the site's terms of service or API terms for automated collection policies.
  • Personal data — even when public, data may fall under PIPA (Korea), GDPR (EU), or CCPA (US).

3. Browser automation tools

Tool First appeared Provider Features
Selenium 2004 OSS, SeleniumHQ The longest-standing standard. WebDriver spec (W3C). Multi-language.
Puppeteer 2017 Chrome team Chrome DevTools Protocol (CDP). Node first.
Playwright 2020 Microsoft (former Puppeteer members joined) Multi-language (Node · Python · Java · .NET). Chromium · Firefox · WebKit. Auto-wait, tracing.

Playwright arrived relatively late but is praised for operational comforts like auto-wait, the selector engine, and the trace viewer. Puppeteer is optimized for the single Chrome target.

4. HTML parsers

Library First appeared Notes
BeautifulSoup 2004, Leonard Richardson A Python staple. Strong with idiomatic HTML.
lxml 2005 C-based (libxml2). Fast. XPath.
parsel around 2017 Scrapy's selector packaged separately. CSS + XPath.
html5lib 2008 Faithful to the HTML5 spec. Slow but compatible.
Cheerio 2012 Node, jQuery-like API.

BeautifulSoup lets us choose the underlying parser among html.parser (standard library), lxml, and html5lib. Speed goes to lxml; compatibility goes to html5lib.

5. Static scraping vs browser automation

  • Static scraping (httpx + BeautifulSoup) — works on server-rendered HTML only. Fast and light.
  • Browser automation (Playwright) — sees the result after JS executes. Slow but essential for SPAs.

Trying static scraping first and only stepping up to a browser when JS dependence is clear is reasonable for cost and speed.

6. Limits of anti-bot

Crawl-blocking techniques and bypasses keep evolving.

  • UA rotation — limited effectiveness. Other signals (header combinations, TLS fingerprint, behavior patterns) are often more decisive.
  • Headless detection evasion — navigator.webdriver, font/canvas fingerprint, WebGL renderer, and others are inspected. Helpers like playwright-stealth exist, but no permanent solution.
  • IP rotation — datacenter IPs are easily blocked, and residential proxies sit in a costly, legally gray zone.
  • Cloudflare · Akamai · PerimeterX — JS challenges, device fingerprints, ML-based. Bypass attempts come close to terms-of-service violations.
  • CAPTCHA — automated solving is risky on both terms-of-service and legal grounds.

What is technically possible and what is ethically and legally permitted are not the same. Blocks are generally read as the site's expression of intent.

7. API first

Korea's public data portal (data.go.kr, opened in 2013) provides many government datasets in OpenAPI form. When the same data is available, an API is the better answer than scraping HTML. It is superior on stability, terms, and structure. The US data.gov and EU data.europa.eu sit in similar positions.

8. Respecting robots.txt and using caches

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('MyBot/1.0', url):
    fetch(url)
headers = {}
if etag := cache.get(f'etag:{url}'):
    headers['If-None-Match'] = etag
r = httpx.get(url, headers=headers)
if r.status_code == 304:
    return cache.get(f'body:{url}')

9. Concurrency limits

import asyncio
sem = asyncio.Semaphore(5)
async def fetch(url):
    async with sem:
        ...

A policy of keeping per-domain concurrency at 5–10 or below is common. Tune to the size and policy of the target site.

10. Common pitfalls

Skipping the terms — there may be clauses banning automated collection. The bigger risk comes from law and terms, not technology.

Retry storms — unbounded retries on 5xx responses turn into a DDoS. Set backoff and a maximum retry count.

Session cookies and auth — scraping after authentication often falls under stricter terms.

License of stored HTML — redistributing crawl results requires copyright and database-right review.

Personal data — even when published, emails and contact details may have collection and retention restrictions.

Closing thoughts

Crawling is more about the boundaries of law, terms, and ethics than about the technology itself. Wherever a public-data OpenAPI is available, that is the safer place to start. Respecting the intent of a block tends to last longer than the urge to bypass anti-bot measures.

Next

  • openapi-spec
  • rest-api-intro

See RFC 9309 — Robots Exclusion Protocol · Playwright · Puppeteer · Selenium · Beautiful Soup · Scrapy · data.go.kr · W3C WebDriver.

More in backend

All in this category →
  • Wrap public OpenAPIs with your own BFF
  • Email Delivery and OTP — SMTP
  • Audit Log — logAdminAction pattern
  • WebSocket and SSE — real-time communication
  • REST API introduction
  • OpenAPI Specification