codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Building public-data crawlers›Step 1

Step 1

Crawler ethics and legal boundaries

0 views

Crawler ethics and legal boundaries

The tech is easy, the boundaries less so. The biggest risk isn't bans or lawsuits but accidental outages on someone else's site.

1. robots.txt

User-agent: *
Disallow: /admin/
Crawl-delay: 10
  • Weakly enforceable legally
  • Ignoring it raises ban and legal risk
  • Declare your crawler UA (MyBot/1.0 (+https://mysite.com/bot))

2. Self rate-limit

await session.get(url)
await asyncio.sleep(1 + random.random())

1–5 req/s is polite. 100 req/s borders on DoS.

3. Terms of Service

  • Many sites forbid automation
  • Public data is different — often there's an open API
  • Portal APIs are the safest path

4. No personal data

Emails, phones, names — PIPA / GDPR territory. Even publicly posted, bulk collection or repurposing is legally fraught.

5. Copyright

  • Copying full text is out
  • Summaries + links are fine
  • Images carry both copyright and portrait rights
  • Databases have their own protection (sui generis)

6. No evasion

CAPTCHA bypass, IP rotation, cookie tricks = intentional circumvention. CFAA / unauthorized access laws apply.

7. Safer defaults

  • Use public data portals first
  • Obey robots.txt
  • 1–5 req/s
  • Announce your UA with contact
  • Skip personal data
  • Summarize and link for copyrighted text

8. If things go wrong

An admin complains about an outage:

  1. Stop immediately
  2. Apologise, explain
  3. Share prevention steps
  4. Offer help with recovery

Most admins forgive honest mistakes. Dishonesty compounds.

9. Public API portals first

  • data.go.kr, data.gov
  • opendart.fss.or.kr, data.nps.or.kr, opendata.hira.or.kr

10. Gotchas

  • Ignoring robots.txt
  • Concurrency too high
  • Personal data inclusion
  • Forged UAs

Closing

Respect the fact that "your requests live on someone else's server". Speed, politeness, and attribution cover 90%.

Next

  • 02-static-vs-dynamic

Step 2 →

Static vs dynamic — BS4 + Playwright