ARTENIS ALIJA
Python · Scraping · Playwright6 min read10 March 2025

Python Scrapers in 2025: The Async Playwright Approach

httpx + BeautifulSoup handles 70% of jobs. For the rest — JS-rendered SPAs, bot detection, infinite scroll — Playwright async is the cleanest tool I've found.

The scraping landscape splits cleanly into two tiers. Static HTML pages: use httpx for fast async requests, BeautifulSoup or lxml for parsing. You can saturate 50 workers on a single core and process thousands of pages per minute. JavaScript-rendered content: reach for Playwright.

Playwright's async API changed how I write scrapers. Instead of synchronising on page load events, I listen for specific network responses — the XHR call that returns the actual product data, for example. This is faster and more reliable than waiting for the DOM to settle.

Bot detection is the harder problem. Rotating residential proxies help. So does using playwright-stealth and simulating realistic mouse movement before any clicks. Browser fingerprinting is where most cheap scrapers fail; spending time on entropy (screen resolution, webGL renderer, timezone) pays off when scraping heavily defended targets.

For job management I use a Postgres table as a simple queue — slugs, status, retry count, last error. It's less clever than Redis Streams but it's trivially introspectable, doesn't require another service, and handles backfill operations cleanly.