codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Building public-data crawlers›Step 3

Step 3

Rate limit · retries · backoff

0 views

Rate limit · retries · backoff

Two axes that make a crawler respect its target — your pace and how you retreat on failure.

1. Self rate-limit

async def polite_get(client, url):
    resp = await client.get(url)
    await asyncio.sleep(1 + random.random())
    return resp

Jitter prevents alignment peaks.

2. Concurrency cap

sem = asyncio.Semaphore(3)
async def bounded_get(client, url):
    async with sem:
        return await polite_get(client, url)

3. Exponential backoff

async def fetch_with_retry(client, url, max_retries=4):
    for i in range(max_retries):
        try:
            resp = await client.get(url, timeout=30)
            if resp.status_code in (429, 503):
                await asyncio.sleep(2 ** i + random.random()); continue
            resp.raise_for_status(); return resp
        except (httpx.TimeoutException, httpx.ConnectError):
            if i == max_retries - 1: raise
            await asyncio.sleep(2 ** i)
    raise RuntimeError("max retries")

4. Respect Retry-After

if resp.status_code == 429:
    retry_after = resp.headers.get("Retry-After")
    if retry_after:
        wait = float(retry_after) if retry_after.isdigit() else parse_http_date(retry_after)
        await asyncio.sleep(wait); continue

5. Circuit breaker

class CircuitBreaker:
    def __init__(self, threshold=5, cooldown=60):
        self.fails = 0; self.opened_at = None
        self.threshold = threshold; self.cooldown = cooldown
    async def call(self, fn):
        if self.opened_at and time.time() - self.opened_at < self.cooldown:
            raise RuntimeError("circuit open")
        try:
            r = await fn(); self.fails = 0; self.opened_at = None; return r
        except Exception:
            self.fails += 1
            if self.fails >= self.threshold: self.opened_at = time.time()
            raise

6. If blocked

Continuous 403/429 → likely banned.

  • Pause for hours
  • Halve the rate
  • Check UA
  • VPN/proxies are usually the wrong call

7. Distributed rate limit

async def acquire_token(key, capacity, refill_per_sec):
    bucket = int(time.time() / 60)
    k = f"rl:{key}:{bucket}"
    count = await redis.incr(k)
    if count == 1: await redis.expire(k, 120)
    return count <= capacity

8. Logging

logger.info("fetch", url=url, status=resp.status_code, attempt=i, wait=wait)

Can't tune what you don't measure.

9. Timeouts

async with httpx.AsyncClient(timeout=httpx.Timeout(connect=10, read=30, write=30)) as client:
    ...

Never run without a timeout.

10. Gotchas

  • No timeout — one request stalls the pipeline
  • Excess concurrency — 100 workers × 1/s = 100/s, easy ban
  • Infinite retries — always set max_retries
  • Ignoring Retry-After — looks hostile

Closing

A crawler that fails frequently is under-tuned. Aim for 95% success. Use backoff and breakers to avoid harming others.

Next

  • 04-apscheduler-kst

← Step 2

Static vs dynamic — BS4 + Playwright

Step 4 →

APScheduler + KST schedules