codingstairs
NotesEDULifeContact
⌕Search⌘K
koen

Navigation

  • Intro
  • Blog
  • Life

Get in touch

Send without signing in. Add your email if you'd like a reply.

  • Leave a message anonymously →
  • ✉ warragon112@gmail.com
  • KakaoTalk Open Chat ↗

© 2026 codingstairs

  • Notes
  • EDU
  • Search
  • Life
  • Contact
  • Legal
  • RSS
  • GitHub
EDU›Building public-data crawlers›Step 2

Step 2

Static vs dynamic — BS4 + Playwright

0 views

Static vs dynamic — BS4 + Playwright

Wrong tool = 10x slower, 10x more likely to be blocked. Decide first.

1. Static (server-rendered)

curl returns the data you want.

  • 100–300ms
  • Light resources
  • requests + BeautifulSoup or httpx

2. Dynamic (JS-rendered)

Source has empty <div id="app"> and JS fills it.

  • 2–10s
  • Hundreds of MB per browser
  • Playwright / Selenium

3. Decide fast

curl https://target.com/page | grep "the text you want"

No match? Open DevTools → Network → XHR. Often there's a JSON API you can call directly.

4. Hidden APIs

Many "dynamic" sites actually call REST APIs. Calling them directly beats Playwright in speed and stability.

5. requests + BS4

import httpx
from bs4 import BeautifulSoup

async with httpx.AsyncClient(headers={"User-Agent": "MyBot/1.0"}) as client:
    resp = await client.get("https://example.com/page")
    soup = BeautifulSoup(resp.text, "html.parser")
    for item in soup.select("div.item"):
        yield {
          "title": item.select_one(".title").text.strip(),
          "price": item.select_one(".price").text.strip(),
        }

6. Playwright

from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)
    page = await browser.new_page()
    await page.goto(url, wait_until="networkidle")
    await page.wait_for_selector(".item")
    titles = await page.locator(".item .title").all_inner_texts()
    await browser.close()

7. Optimizations

await page.route("**/*.{png,jpg,gif,svg,woff,woff2,css}", lambda r: r.abort())

Block images/styles → 3–5x faster.

8. Reusable context

context = await browser.new_context(user_agent="MyBot/1.0")
page1 = await context.new_page()
page2 = await context.new_page()

Shares cookies / storage.

9. Hybrid

Playwright once for login / JS-rendered listing; BS4 in parallel for details.

urls = await extract_urls_with_playwright(list_page)
async with httpx.AsyncClient() as client:
    details = await asyncio.gather(*[fetch_bs4(client, u) for u in urls])

10. Gotchas

  • Playwright for static pages — wasteful
  • BS4 for SPAs — empty HTML
  • Missing hidden APIs — check Network tab
  • Default Playwright timeout (30s) too short on slow sites

Closing

"curl first, hidden API next, Playwright last" — preserves speed, stability, and politeness.

Next

  • 03-rate-limit-backoff

← Step 1

Crawler ethics and legal boundaries

Step 3 →

Rate limit · retries · backoff